Name		Name	Last commit message	Last commit date
parent directory ..
bin		bin
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
README.adoc		README.adoc
blogsearch-crawler.ts		blogsearch-crawler.ts
blogsearch.config.js.template.ts		blogsearch.config.js.template.ts
checkers.ts		checkers.ts
configTypes.ts		configTypes.ts
crawler.ts		crawler.ts
database.ts		database.ts
defaultFields.ts		defaultFields.ts
package.json		package.json
tsconfig.json		tsconfig.json

README.adoc

BlogSearch index building tool for generic static webpages (crawler)

🔥	This is a part of BlogSearch project. If you would like to know the overall concept, go to the parent directory.

1. Building a search index file

The easiest way

npx blogsearch-crawler

The formal way

npm install -g blogsearch-crawler
blogsearch-crawler

Configuration

🔥	Go to the "What’s in the index file" section of the main project. For more details on how to configure fields.

blogsearch.config.js

module.exports = {
  // [Mandatory] This must be 'simple'.
  type: 'simple',
  // [Mandatory] Generated blogsearch database file.
  output: './my_blog_index.db.wasm',
  // [Mandatory] List of entries to parse. The crawler uses glob pattern internally.
  // How to use glob: https://github.com/isaacs/node-glob
  entries: [
    './reactjs.org/public/docs/**/*.html'
  ],
  // [Mandatory] Fields configurations.
  // See: https://github.com/kbumsik/blogsearch#whats-in-the-index
  fields: {
    title: {
      // The value can be a CSS selector.
      parser: 'article > header',
    },
    body: {
      // Set false if you want to reduce the size of the database.
      hasContent: true,
      // It can be a function as well.
      parser: (entry, page) => {
        // Use puppeteer page object.
        // It's okay to return a promise.
        return page.$eval('article > div > div:first-child', el => el.textContent);
      }
    },
    url: {
      // By setting this false the search engine won't index the URL.
      indexed: false,
      parser: (entry, page) => {
        // entry is a string of the path being parsed.
        return entry.replace('./reactjs.org/public', 'https://reactjs.org');
      },
    },
    categories: {
      // This is disabled because the target website doesn't have categories.
      enabled: false,
      // This is a dummy parser. This is unused because the field is disabled.
      parser: () => 'categories-1, categories-2',
    },
    tags: {
      // This is disabled because the target website doesn't have tags.
      enabled: false,
      // This is a dummy parser. This is unused because the field is disabled.
      parser: () => 'tags-1, tags-2',
    },
  }
};

2. Enabling the search engine in the webpage

You need to enable the search engine in the web page. Go to blogsearch Engine.

Again, if you would like to understand the concept of BlogSearch, go to the parent directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blogsearch-crawler

blogsearch-crawler

README.adoc

BlogSearch index building tool for generic static webpages (crawler)

1. Building a search index file

The easiest way

The formal way

Configuration

2. Enabling the search engine in the webpage

Files

blogsearch-crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

blogsearch-crawler

Folders and files

parent directory

README.adoc

BlogSearch index building tool for generic static webpages (crawler)

1. Building a search index file

The easiest way

The formal way

Configuration

2. Enabling the search engine in the webpage