SiteDiff makes it easy to see how a website changes. It can compare two similar sites against each other, or it can show how a single site changed over time. It is a useful tool for conducting QA on re-deployments, site upgrades, and more!
Each time you run SiteDiff, it produces an HTML report showing each requested path, and whether it has changed or not. For changed paths, you can see a colorized diff of the changes, or compare the visual differences side-by-side in a browser.
SiteDiff supports a range of normalization/sanitization rules. These allow you to eliminate spurious differences, narrowing down the differences to the ones that materially affect the site.
- Introduction
- Demo
- Installation
- User's guide
- Getting started
- Comparing multiple sites
- Preventing spurious diffs
- Getting help
- Tips & tricks
- Configuration
To quickly see what SiteDiff can do:
git clone https://github.com/evolvingweb/sitediff
cd sitediff
bundle install
bundle exec thor fixture:serve
Then visit http://localhost:13080 to view the report.
Here is an example SiteDiff report:
And here is an example SiteDiff diff of a specific path:
You'll need Ruby 1.9.3 or higher. To speed things up, we first recommend installing nokogiri and certain dependencies manually. The following works on Ubuntu 14.04 and 16.04:
sudo apt-get install -y ruby-dev libz-dev gcc patch make
sudo apt-get install -y libxml2-dev libxslt-dev
sudo gem install nokogiri --no-rdoc --no-ri -- --use-system-libraries=true --with-xml2-include=/usr/include/libxml2
Then install sitediff:
sudo gem install sitediff
To track changes over time using SiteDiff, create a configuration for your site:
sitediff init http://mysite.example.com
SiteDiff will crawl your site, finding pages and caching their contents. You can open the configuration file sitediff/sitediff.yaml
to see what SiteDiff found. See the configuration reference for details on the contents of that file, and how you might want to alter it.
Now, you can make alterations to your site. For example, try upgrading any frameworks that your site uses. After you're done, check what actually changed:
sitediff diff
For each page, SiteDiff will report whether it did or did not change. For pages that changed, it will display a diff. If you want a nicer view of the changes, run SiteDiff's web report:
sitediff serve
SiteDiff will start an internal web server and open a report page on your browser. For each page, you can see the diff and a side-by-side view of the old and new versions.
You can now see if the changes were as you expected, or if some things didn't quite work out as you hoped. If you noticed unexpected changes, congratulations: SiteDiff just helped you find a bug you would have missed otherwise!
As you fix any issues, you can continue to alter your site and run sitediff diff
to check the changes against the old version. Once you're satisfied with the state of your site, you can inform SiteDiff that it should re-cache your site:
sitediff store
The next time you run sitediff diff
, it will use this new version as the baseline for comparison.
Happy diffing!
Sometimes you have two sites that you want to compare, for example a production site hosted on a public server and a development site hosted on your computer. SiteDiff can handle this situation, too! Just inform SiteDiff that there are two sites to compare:
sitediff init http://mysite.example.com http://localhost/mysite
Then when you run sitediff diff
, it will compare the cached version of the first site with the current version of the second site.
If both the first and second sites may be changing, you should tell SiteDiff not to cache either site:
````sitediff diff --cached=none```
Sometimes sites have spurious differences, that you don't want to show up in a comparison. For example, many sites protect against Cross-Site Request Forgery using a semi-random token. Since this token changes on each HTTP GET, you probably don't care about such a change.
To help with issues such as this, SiteDiff allows you to normalize the HTML it fetches as it compares pages. In the sitediff.yaml
configuration file, you can add "sanitization rules", which specify either DOM transformations or regular expression substitutions.
Here's an example of a rule you might add to remove Django CSRF-protection tokens:
dom_transform:
- title: Remove CSRF tokens
type: remove
selector: input[name=csrfmiddlewaretoken]
When you run sitediff init
, SiteDiff will even auto-detect some potentially useful rules, and include them in your configuration file. They start disabled, but you can easily remove the disabled: true
line to try them out. Currently only rules useful for common Drupal sites are auto-detected.
See the configuration file reference for more details.
SiteDiff has built-in help! To see a list of commands:
sitediff help
To get help on the options for a particular command, eg: diff
:
sitediff help diff
-
Finding configuration files
SiteDiff will attempt to auto-detect any nearby
sitediff.yaml
files. But if yoursitediff.yaml
is in an unusual place, you can force SiteDiff to change to that directory:sitediff diff -C some-directory
Or you can even specify a list of config files at the command-line:
sitediff diff myconfig.yaml otherconfig.yaml ...
-
Handling large configuration files
If your configuration file starts getting really big, SiteDiff lets you separate it out into multiple files. Just have one base file that includes other files:
includes: - sanitization.yaml - paths.yaml
This allows you to separate your configuration into logically groups. For example, generic rules for your site could live in a
generic.yaml
file, while rules pertaining to a particular update you're conducting could live inupdate-8.2.yaml
. -
Specifying paths
When you run
sitediff diff
, you can specify which pages to look at in several ways:-
The
paths
key in your configuration file. -
The option
--paths /foo /bar ...
.If you're trying to fix one page in particular, specifying just that one path will make
sitediff diff
run quickly! -
The option
--paths-file FILE
with a newline-delimited text file.This is particularly useful when you're trying to eliminate all diffs. SiteDiff creates a file
output/failures.txt
containing all paths which had differences, so as you try to fix differences, you can run:sitediff diff --paths-file output/failures.txt
-
-
Debugging rules
When a sanitization rule isn't working quite right for you, you might run
sitediff diff
many times over. If fetching all the pages is taking too long, try adding the option--cached=all
. This tells SiteDiff not to re-fetch the contente, but just compare the previously cached version—it's a lot faster! -
Handling security
Often development or staging sites are protected by HTTP Authentication. SiteDiff allows you to specify a username and password, by using a URL like
http://user:[email protected]
. -
Running inside containers
If you run SiteDiff inside a container or virtual machine, the URLs in its report might not work from your host, such as
localhost
. You can fix this by using the--before-url-report
and--after-url-report
options, to tell SiteDiff to use a different URL in the report than the one it uses for fetching.For example, if you ran
sitediff init http://mysite.com http://localhost
inside a Vagrant VM, you might then run something like:sitediff diff --after-url-report=http://vagrant:8080
SiteDiff relies on a YAML configuration file, usually called sitediff.yaml
. You can create a reasonable one using sitediff init
, but there are many useful things you may want to manually add or change.
The following sitediff.yaml
keys are recognized by SiteDiff:
-
before_url and after_url: The two base URLs to compare, for example:
before_url: http://example.com/subsite after_url: http://localhost:8080/subsite
They can also be paths to directories on the local filesystem.
The after_url MUST provided either at the command-line or in the sitediff.yaml. If the before_url is provided, SiteDiff will compare the two sites. Otherwise, it will compare the current version of the 'after' site with the stored version of that site, as created by
sitediff init
orsitediff store
. -
paths: The list of paths to check, rooted at the base URL. For example:
paths: - index.html - page.html - cgi-bin/test.cgi?param=value
In the example above, SiteDiff would compare
http://example.com/subsite/index.html
andhttp://localhost:8080/subsite/index.html
, followed by page.html, and so on.The paths MUST be provided either at the command-line or in the
sitediff.yaml
file. -
selector: Chooses the sections of HTML we wish to compare, if you don't want to compare the entire page. For example if you want to only compare breadcrumbs between your two sites, you might specify:
selector: '#breadcrumb'
-
before_url_report and after_url_report: Changes how SiteDiff reports which URLs it is comparing, but don't change what it actually compares.
Suppose you are serving your 'after' website on a virtual machine with IP 192.1.2.3, and you are also running SiteDiff inside that VM. To make links in the report accessible from outside the VM, you might provide
after_url: http://localhost after_url_report: http://192.1.2.3
-
sanitization: A list of regular expression rules to normalize your HTML for comparison.
Each rule should have a pattern regex, which is used to search the HTML. Each found instance is replaced with the provided substitute, or deleted if no substitute is provided. A rule may also have a selector, which constrains it to operate only on HTML fragments which match that CSS selector.
For example, forms on Drupal sites have a randomly generated
form_build_id
on form pages:<input type="hidden" name="form_build_id" value="form-1cac6b5b6141a72b2382928249605fb1"/>
We're not interested in comparing random content, so we could use the following rule to fix this:
sanitization: # Remove form build IDs - pattern: '<input type="hidden" name="form_build_id" value="form-[a-zA-Z0-9_-]+" *\/?>' selector: 'input' substitute: '<input type="hidden" name="form_build_id" value="__form_build_id__">'
Sanitization rules may also have a path attribute, whose value is a regular expression. If present, the rule will only apply to matching paths.
-
dom_transform: A list of transformations to apply to the HTML before comparing.
This is similar to sanitization, but it applies transformations to the structure of the HTML, instead of to the text. Each transformation has a type, and potentially other attributes. The following types are available:
- remove: Given a selector, removes all elements that match it.
For example, say we have a block containing the current time, which is expected to change. To ignore that, we might choose to delete the block before comparison:
dom_transform: # Remove current time block - type: remove - selector: div#block-time
-
unwrap: Given a selector, replaces all matching elements with their children. For example, your content on one side of the comparison might look like this:
<p>This is some text</p> <img src="test.png"/>
But on the other side, it might be wrapped in an
article
tag:<article> <p>This is some text</p> <img src="test.png"/> </article>
You could fix it with the following configuration:
dom_transform: - type: unwrap selector: article
-
remove_class: Given a selector and a class, removes that class from each element that matches the selector. It can also take a list of classes, instead of just one.
For example, here are two sample rules for removing a single class and removing multiple classes from all
div
elements:dom_transform: # Remove class foo from div elements - type: remove_class selector: div class: class-foo # Remove class bar and class baz from div elements - type: remove_class selector: div class: - class-bar - class-baz
- unwrap_root: Replaces the entire root element with its children.
-
before and after: Applies rules to just one side of the comparison.
These blocks can contain any of the following sections: selector, sanitization, dom_transform. Such a section placed in before will be applied just to the before side of the comparison, and similarly for after.
For example, if you wanted to let different date formatting not create diff failures, you might use the following:
before: sanitization: - pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}' substitute: '__date__' after: sanitization: - pattern: '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}' substitute: '__date__'
The above rule will replace dates of the form
2004/12/05
in before and dates of the formMay 12th 2004
in after with__date__
. -
includes: The names of other configuration YAML files to merge with this one.
includes: - config/sanitize_domains.yaml - config/strip_css_js.yaml
The config
directory contains some example sitediff.yaml
files. For example, sitediff.yaml.example.