Skip to content

Commit

Permalink
Web Search: Playwright, spatial parsing, markdown (huggingface#1094)
Browse files Browse the repository at this point in the history
* feat: playwright, spatial parsing, markdown for web search

Co-authored-by: Aaditya Sahay <[email protected]>

* feat: choose multiple clusters if necessary (#2)

* chore: resolve linting failures

* feat: improve paring performance and error messages

* feat: combine embeddable chunks together on cpu

* feat: reduce parsed pages from 10 to 8

* feat: disable javascript in playwright by default

* feat: embedding and parsing error messages

* feat: move isURL, fix type errors, misc

* feat: misc cleanup

* feat: change serializedHtmlElement to interface

* fix: isUrl filename

* fix: add playwright dependencies to docker

* feat: add playwright browsers to docker image

* feat: enable javascript by default

* feat: remove error message from console on failed page

---------

Co-authored-by: Aaditya Sahay <[email protected]>
Co-authored-by: Aaditya Sahay <[email protected]>
  • Loading branch information
3 people authored May 13, 2024
1 parent 18fba9f commit 9ec5d84
Show file tree
Hide file tree
Showing 39 changed files with 1,871 additions and 481 deletions.
3 changes: 2 additions & 1 deletion .env
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ SEARXNG_QUERY_URL=# where '<query>' will be replaced with query keywords see htt

WEBSEARCH_ALLOWLIST=`[]` # if it's defined, allow websites from only this list.
WEBSEARCH_BLOCKLIST=`[]` # if it's defined, block websites from this list.
WEBSEARCH_JAVASCRIPT=true # CPU usage reduces by 60% on average by disabling javascript. Enable to improve website compatibility

# Parameters to enable open id login
OPENID_CONFIG=`{
Expand Down Expand Up @@ -155,4 +156,4 @@ ALLOWED_USER_EMAILS=`[]` # if it's defined, only these emails will be allowed to
USAGE_LIMITS=`{}`
ALLOW_INSECURE_COOKIES=false # recommended to keep this to false but set to true if you need to run over http without tls
METRICS_PORT=
LOG_LEVEL=info
LOG_LEVEL=info
6 changes: 6 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,12 @@ COPY --chown=1000 gcp-*.json /app/
COPY --from=builder --chown=1000 /app/build /app/build
COPY --from=builder --chown=1000 /app/node_modules /app/node_modules

RUN npx playwright install

USER root
RUN npx playwright install-deps
USER user

RUN chmod +x /app/entrypoint.sh

CMD ["/bin/bash", "-c", "/app/entrypoint.sh"]
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,8 @@ You can enable the web search through an API by adding `YDC_API_KEY` ([docs.you.

You can also simply enable the local google websearch by setting `USE_LOCAL_WEBSEARCH=true` in your `.env.local` or specify a SearXNG instance by adding the query URL to `SEARXNG_QUERY_URL`.

You can enable Javascript when parsing webpages to improve compatibility with `WEBSEARCH_JAVASCRIPT=true` at the cost of increased CPU usage. You'll want at least 4 cores when enabling.

### Custom models

You can customize the parameters passed to the model or even use a new model by updating the `MODELS` variable in your `.env.local`. The default one can be found in `.env` and looks like this :
Expand Down
Loading

0 comments on commit 9ec5d84

Please sign in to comment.