Skip to content

crawls a website and creates a sitemap of internal links using d3.js tree

Notifications You must be signed in to change notification settings

ptutty/sitemapcreator

Repository files navigation

Tree view site map generator

Discovers all the pages in site or single page app (SPA) and creates a tree of the result in ./output/<site slug/crawl.json. Optionally takes screenshots of each page as it is visited.

Demo

Demo website

Prerequistes

  • Node v8+

Install Node using HomeBrew on Mac

Download project and install dependancies

clone https://github.com/ptutty/sitemapcreator
cd sitemapcreator
npm install

Setup Configuration file

  • edit config-sample.json and rename config.json, add the URL details of the site you wish to crawl
  • depth is how many levels to crawl
  • if you wish to test you may find it useful to set headless: false so see what is going on.
  • the filter flag allows you to cusomize anchors link which are crawled
{
    "host": "https://www.bbc.co.uk",
    "path": "/sport",
    "depth": 2,
    "headless": true,
    "filters": false
}

Filters

Filter allow you to remove unwanted cruff from the visualisation, such as: page anchors links, links back to the homepage, links to documents, intranet links etc. See the array 'excludeAnchorsWhichContain' below

Sometime you may wish not to crawl the navigation again on each subpage, you can list URL fragments in the array 'excludeSubpageAnchorsEndingWith'

{ 
    "excludeSubpageAnchorsEndingWith" : [
        "/live/",
        "/programmes/",
    ],
    "excludeAnchorsWhichContain" : [
        "#",
        ".pdf",
        "docx",
        "doc"
    ]
}

Start a crawl and capture data

To start a crawl, run the command below in the console - make sure you are in the project directory.

  node app.js

You will see URL's being crawled in the console. You can also run a crawl and capture optional screenshots

  node app.js --screenshots

View the visualisation

Start a local server.

  node server.js

Then open the URL below in a browser:

http://localhost:8080/html/d3tree.html?url=../output/https___yourspa.com/crawl.json

About

crawls a website and creates a sitemap of internal links using d3.js tree

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages