@mikandev/ssg-parser

A utility package for scraping and processing website content, with AI-powered text extraction and preprocessing capabilities.

Installation

npm install @mikandev/ssg-parser

Features

Website scraping and content extraction
Advanced text cleaning and preprocessing
Script tag removal from HTML content
AI-powered text parsing with Gemini or Jina
Conversion of HTML content to clean markdown

Usage

Basic Compilation with Jina

import { compile } from '@mikandev/ssg-parser';

async function main() {
  const baseUrl = 'https://example.com';
  const jinaApiKey = 'YOUR_JINA_API_KEY';
  
  try {
    const compiledText = await compile(baseUrl, jinaApiKey);
    console.log(compiledText);
  } catch (error) {
    console.error('Error:', error);
  }
}

main();

Using Gemini for Compilation

import { geminiCompile } from '@mikandev/ssg-parser';

async function main() {
  // Requires GOOGLE_GENERATIVE_AI_API_KEY environment variable
  const baseUrl = 'https://example.com';
  
  try {
    const compiledText = await geminiCompile(baseUrl);
    console.log(compiledText);
  } catch (error) {
    console.error('Error:', error);
  }
}

main();

Preprocessing Pages

import { preprocessPages } from '@mikandev/ssg-parser';

async function main() {
  const baseUrl = 'https://example.com';
  
  try {
    const processedPages = await preprocessPages(baseUrl);
    
    for (const [url, html] of processedPages) {
      console.log(`Processed ${url}, content length: ${html.length}`);
    }
  } catch (error) {
    console.error('Error:', error);
  }
}

main();

API Reference

`compile(baseUrl: string, apiKey: string): Promise<string>`

Compiles parsed text from all internal links of a given base URL using Jina.

baseUrl: The starting URL to scrape
apiKey: API key for the Jina Reader API

Returns a promise that resolves to the compiled text.

`geminiCompile(baseUrl: string): Promise<string>`

Compiles parsed text from all internal links of a given base URL using Gemini.

baseUrl: The starting URL to scrape
Requires Google AI Studio API key set as GOOGLE_GENERATIVE_AI_API_KEY environment variable

Returns a promise that resolves to the compiled text.

`preprocessPages(url: string, processedPages?, visited?): Promise<Map<string, string>>`

Scrapes a site and preprocesses pages by removing script tags and cleaning text.

url: The starting URL to scrape
processedPages: Optional Map to store URL to processed HTML content
visited: Optional Set of already visited URLs

Returns a promise resolving to a map of URLs to processed HTML content.

License

See the repository for license information.

Repository

GitHub Repository

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@mikandev/ssg-parser

Installation

Features

Usage

Basic Compilation with Jina

Using Gemini for Compilation

Preprocessing Pages

API Reference

`compile(baseUrl: string, apiKey: string): Promise<string>`

`geminiCompile(baseUrl: string): Promise<string>`

`preprocessPages(url: string, processedPages?, visited?): Promise<Map<string, string>>`

License

Repository

About

Releases

Packages

Languages

License

mikndotdev/ssg-parser

Folders and files

Latest commit

History

Repository files navigation

@mikandev/ssg-parser

Installation

Features

Usage

Basic Compilation with Jina

Using Gemini for Compilation

Preprocessing Pages

API Reference

compile(baseUrl: string, apiKey: string): Promise<string>

geminiCompile(baseUrl: string): Promise<string>

preprocessPages(url: string, processedPages?, visited?): Promise<Map<string, string>>

License

Repository

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`compile(baseUrl: string, apiKey: string): Promise<string>`

`geminiCompile(baseUrl: string): Promise<string>`

`preprocessPages(url: string, processedPages?, visited?): Promise<Map<string, string>>`

Packages