diff --git a/content/docs/overview/guides/aeo.mdx b/content/docs/overview/guides/aeo.mdx new file mode 100644 index 00000000..32976d71 --- /dev/null +++ b/content/docs/overview/guides/aeo.mdx @@ -0,0 +1,6 @@ +--- +title: Build an AEO Scraper (Node) +description: Scrape LLM providers with Steel and synthesize answers with OpenAI +sidebarTitle: AEO Scraper (Node) +llm: true +--- diff --git a/content/docs/overview/guides/meta.json b/content/docs/overview/guides/meta.json index 9e2294ad..a84d93ad 100644 --- a/content/docs/overview/guides/meta.json +++ b/content/docs/overview/guides/meta.json @@ -7,6 +7,7 @@ "playwright-node", "playwright-python", "puppeteer", - "selenium" + "selenium", + "perplexity" ] } diff --git a/content/docs/overview/guides/perplexity.mdx b/content/docs/overview/guides/perplexity.mdx new file mode 100644 index 00000000..876c4e68 --- /dev/null +++ b/content/docs/overview/guides/perplexity.mdx @@ -0,0 +1,469 @@ +--- +title: Build a Perplexity‑style Search Engine +description: Search with Brave, scrape with Steel, and synthesize with OpenAI using a TypeScript CLI +sidebarTitle: Perplexity Clone (Node) +llm: true +--- + +This guide shows you how to build a Perplexity-like research workflow in Node.js/TypeScript that: +- Finds relevant links with the Brave Search API +- Scrapes those links to Markdown via Steel’s /v1/scrape endpoint +- Synthesizes a well-cited answer with inline citations + +Looking for a ready-made starter? Skip to the example project section. + +Quick Start +----------- + +Clone the example and run it locally: + +```bash +git clone https://github.com/steel-dev/steel-cookbook +cd steel-cookbook/examples/steel-perplexity-clone +npm install + +# Create a .env file in this directory with your credentials +# See "Configuration" below for required variables. + +# Option A: put QUERY in .env +npm start + +# Option B: pass QUERY on the fly +QUERY="What are the latest improvements in WebAssembly?" npm start +``` + +- Node.js: Requires Node 18+ +- Credentials: You’ll need API keys for Steel.dev, OpenAI, and Brave Search. + +Project Structure +----------------- + +```bash +examples/steel-perplexity-clone + ├─ src/ + │ ├─ config.ts # Env parsing, defaults, feature flags + │ ├─ clients.ts # Brave search, Steel scrape, OpenAI synthesis + │ └─ index.ts # Main pipeline orchestration + ├─ package.json + ├─ tsconfig.json + └─ README.md +``` + +Configuration +------------- + +Create a `.env` file in `examples/steel-perplexity-clone`: + +```env +NODE_ENV=development + +# OpenAI +OPENAI_API_KEY=sk-... +OPENAI_ORG_ID= +OPENAI_MODEL=gpt-5-nano + +# Steel.dev +STEEL_API_KEY=steel_... + +# Brave Search +BRAVE_API_KEY=brv_... +BRAVE_SEARCH_ENDPOINT=https://api.search.brave.com/res/v1/web/search +BRAVE_SEARCH_COUNTRY=US +BRAVE_SEARCH_LANG=en +BRAVE_SAFESEARCH=moderate + +# Search behavior +SEARCH_TOP_K=10 +REQUEST_TIMEOUT_MS=5000 +CONCURRENCY=5 + +# Your question to research +QUERY="What are the latest improvements in WebAssembly and their benefits?" +``` + +What this example does +---------------------- + +At a high level: + +1. Search Brave for relevant URLs + +2. Scrape sources to Markdown with Steel + - Sends each URL to Steel’s `/v1/scrape` to obtain clean Markdown + +3. Synthesize a well‑cited answer with OpenAI + - Builds a context block from scraped Markdown + - Instructs the model to produce inline [n] citations, matching the material order + +The core orchestration happens here: + +```typescript Typescript -wcn -f index.ts +import { config } from "./config"; +import { + scrapeUrlsToMarkdown, + synthesizeWithCitations, + multiQueryBraveSearch, +} from "./clients"; + +type SearchResponse = { + query: string; + answer: string; + citations: Array<{ index: number; url: string }>; + model: string; + meta: { + tookMs: number; + }; +}; + +async function main() { + const started = Date.now(); + + const query = config.query; + const topK = config.search.topK; + const concurrency = config.concurrency; + + console.info("Search request received", { + query, + topK, + }); + + // 1) Use Brave to get top relevant URLs (do double to get more relevant results to search) + const { urls } = await singleQueryBraveSearch(query, topK * 2); + + if (urls.length === 0) { + return console.error("No URLs found for the given query."); + } + + // 2) Scrape each URL into markdown using Steel.dev + const materials = await scrapeUrlsToMarkdown(urls, concurrency, topK); + + if (materials.length === 0) { + console.error("Failed to scrape all URLs. Try again or refine your query."); + } + + // 3) Use OpenAI to synthesize an answer with inline citations + const synthesis = await synthesizeWithCitations({ + query, + materials, + }); + + const tookMs = Date.now() - started; + + const response: SearchResponse = { + query, + answer: synthesis.answer, + citations: synthesis.sources, + model: config.openai.model, + meta: { tookMs }, + }; + + console.log(response); +} + +// Execute the demo +main() + .then(() => { + process.exit(0); + }) + .catch((error) => { + console.error("Task execution failed:", error); + process.exit(1); + }); +``` + +Step 1: Get relevant URLs +--------------------------------------- + +- The example calls the Brave API to recieve relevant URLs based on the user query + +```typescript Typescript -wcn +export async function singleQueryBraveSearch( + userQuery: string, + topKPerQuery = config.search.topK, +): Promise { + const spinner = ora("Searching...").start(); + const normalizedQuery = userQuery.trim() || userQuery; + const queries = [normalizedQuery]; + + try { + const { urls } = await searchTopRelevantUrls( + normalizedQuery, + topKPerQuery ?? config.search.topK, + ); + + spinner.succeed("Search complete"); + + return { + queries, + urls, + _raw: { perQueryUrls: [urls] }, + }; + } catch (err) { + spinner.fail("Search failed"); + console.warn("Brave search failed for query", { + query: normalizedQuery, + err: (err as Error)?.message, + }); + + return { + queries, + urls: [], + _raw: { error: err }, + }; + } +} +``` + +Under the hood, the Brave call itself looks like this: + +```typescript Typescript -wcn +export async function searchTopRelevantUrls( + query: string, + topK = config.search.topK, +): Promise { + // Build Brave Search request URL with query params + const endpoint = new URL(config.brave.endpoint); + endpoint.searchParams.set("q", query); + endpoint.searchParams.set("country", config.brave.country); + endpoint.searchParams.set("search_lang", config.brave.lang); + endpoint.searchParams.set("safesearch", config.brave.safesearch); + endpoint.searchParams.set( + "count", + String(Math.min(topK, config.search.topK)), + ); + + const res = await fetchWithTimeout(endpoint.toString(), { + headers: { + Accept: "application/json", + "X-Subscription-Token": config.brave.apiKey, + }, + }); + + if (!res.ok) { + const text = await res.text().catch(() => ""); + console.error("Brave search failed", { + status: res.status, + statusText: res.statusText, + response: text?.slice(0, 1000), + }); + throw new Error(`Brave search failed: ${res.status} ${res.statusText}`); + } + + const data = (await res.json()) as any; + + // Extract URLs from Brave response + const urls: string[] = []; + if (data?.web?.results && Array.isArray(data.web.results)) { + for (const r of data.web.results) { + if (typeof r?.url === "string") urls.push(r.url); + } + } else if (Array.isArray(data?.results)) { + for (const r of data.results) { + if (typeof r?.url === "string") urls.push(r.url); + } + } + + if (urls.length === 0) { + const rawText = JSON.stringify(data); + const regex = /\bhttps?:\/\/[^\s"'<>]+/gi; + const salvaged = (rawText.match(regex) ?? []) as string[]; + urls.push(...salvaged); + } + + // Normalize and dedupe + const normalized = Array.from(new Set(urls.map((u) => u.trim()))) + .filter(Boolean) + .slice(0, topK); + + return { + urls: normalized, + }; +} +``` + +Step 2: Scrape each URL to Markdown with Steel +---------------------------------------------- + +- For each URL, make a request to Steel's `/v1/scrape` endpoint. +- Request Markdown by setting `format: ["markdown"]`. +- The response contains `content.markdown`, and metadata. + +```typescript Typescript -wcn +export async function scrapeUrlToMarkdown( + url: string, +): Promise { + try { + const client = new Steel({ + steelAPIKey: config.steel.apiKey, + timeout: config.requestTimeoutMs, + }); + + const res = await client.scrape({ + url, + format: ["markdown"], + }); + + const markdown = res?.content?.markdown; + const links = res?.links; + + if (!markdown) { + throw new Error(`Steel.dev response missing markdown content for ${url}`); + } + + return { url, markdown, links }; + } catch { + return null; + } +} +``` + +Step 3: Synthesize an answer with inline citations +-------------------------------------------------- + +- Build a context that enumerates materials like `[1] URL`, then the Markdown. +- Prompt the model to cite with `[n]` as it writes. +- Return an answer plus a `sources` array mapping `[n] -> url`. + +```typescript Typescript -wcn +export async function synthesizeWithCitations( + input: SynthesisInput, +): Promise { + const spinner = ora("Synthesizing answer...").start(); + // Build context block + const contextHeader = + "Context materials (each item shows [index] and URL, followed by markdown content)"; + const contextLines: string[] = [contextHeader]; + input.materials.forEach((m, i) => { + const idx = i + 1; + contextLines.push(`\n[${idx}] ${m.url}\n---\n${m.markdown}\n`); + }); + + const now = new Date(); + + // Day of week, month, day, year + const dateFormatter = new Intl.DateTimeFormat("en-NZ", { + weekday: "long", + month: "long", + day: "2-digit", + year: "numeric", + timeZone: "Pacific/Auckland", + }); + + // Time with hour + timezone abbreviation + const timeFormatter = new Intl.DateTimeFormat("en-NZ", { + hour: "numeric", + minute: "2-digit", + hour12: true, + timeZone: "Pacific/Auckland", + timeZoneName: "short", // gives "NZDT" + }); + + const dateStr = dateFormatter.format(now); + const timeStr = timeFormatter.format(now); + + // Combine + remove the minutes (":00") if you want "7 PM" instead of "7:00 PM" + const final = `${dateStr}, ${timeStr.replace(/:00/, "")}`; + + const system = ` You are ...` + + const user = [`User query: ${input.query}`, "", contextLines.join("\n")].join( + "\n", + ); + let answer = ""; + let started = false; + + const completion = await openai.chat.completions.create({ + model: config.openai.model, + messages: [ + { role: "system", content: system }, + { role: "user", content: user }, + ], + stream: true, + }); + + for await (const chunk of completion) { + const content = chunk.choices[0]?.delta?.content; + if (content) { + if (!started) { + started = true; + spinner.succeed("Answer synthesized"); + process.stdout.write("\n"); + } + answer += content; + process.stdout.write(content); + } + } + + // Collect sources in index order for convenience + const sources = input.materials.map((m, i) => ({ index: i + 1, url: m.url })); + + console.log("\n\nSources:"); + sources.forEach((source) => { + console.log(`[${source.index}] ${source.url}`); + }); + + return { + answer, + sources, + }; +} +``` + +Run and interpret the output +---------------------------- + +After `npm start`, the script logs the result step-by-step: + +``` +✔ Search complete +✔ Scraping complete +✔ Answer synthesized + +## Prediction Markets +Prediction markets offer a practical way to +hedge specific risks and to add liquidity to broader +market positions by turning uncertain outcomes into tradable, +cash-settled contracts. Their price signals aggregate diverse +information in real time, creating hedging tools and a more +liquid trading environment than many traditional markets. [1] ... +``` + +Tuning and tips +--------------- + +- Expand coverage + - Increase `SEARCH_TOP_K` to retrieve and scrape more URLs. + - `CONCURRENCY` controls how many pages you scrape at once. + +- Respect rate limits +- Steel Hobby plan allows ~20 requests/min. + + +- Timeouts + - `REQUEST_TIMEOUT_MS` applies to both Brave and Steel requests. + +- Models + - Use `OPENAI_MODEL` to choose a cost-effective model for both query generation and synthesis. + +- Debugging + - Log the ranked URL list before scraping if you need to inspect relevance. + +Example project +--------------- + +- GitHub: https://github.com/steel-dev/steel-cookbook/tree/main/examples/steel-perplexity-clone + +What to customize next +---------------------- + +- Swap Brave for another Search API if you prefer +- Add caching for search and scrapes +- Persist answers and materials to a database +- Filter sources by domain whitelist/blacklist + +Support +------- + +- Steel Documentation: https://docs.steel.dev +- API Reference: https://docs.steel.dev/api-reference +- Discord Community: https://discord.gg/steel-dev diff --git a/content/docs/overview/guides/playwright-node.mdx b/content/docs/overview/guides/playwright-node.mdx index e764acf9..9e8252c8 100644 --- a/content/docs/overview/guides/playwright-node.mdx +++ b/content/docs/overview/guides/playwright-node.mdx @@ -11,7 +11,7 @@ Steel sessions are designed to be easily driven by Playwright. There are two mai -**Quick Start:** Want to jump right in? [Skip to example project](https://docs.steel.dev/overview/guides/connect-with-playwright-node#example-project-scraping-hacker-news). +**Quick Start:** Want to jump right in? [Skip to example project](https://docs.steel.dev/overview/guides/playwright-node#example-project-scraping-hacker-news). Method #1: One-line change (_easiest)_ --------------------------------------