Skip to content

Commit

Permalink
feat: Add AWS backend
Browse files Browse the repository at this point in the history
  • Loading branch information
clemlesne committed Jan 15, 2025
1 parent 989a97d commit ffb1ef5
Show file tree
Hide file tree
Showing 21 changed files with 1,383 additions and 448 deletions.
8 changes: 4 additions & 4 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# OpenAI
# Azure OpenAI
# AZURE_OPENAI_API_KEY=xxx # Required if not using AAD Managed Identity
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=text-embedding-3-large-1
AZURE_OPENAI_EMBEDDING_DIMENSIONS=3072
AZURE_OPENAI_EMBEDDING_MODEL_NAME=text-embedding-3-large
AZURE_OPENAI_ENDPOINT=https://xxx.openai.azure.com

# AI Search
# Azure AI Search
# AZURE_SEARCH_API_KEY=xxx # Required if not using AAD Managed Identity
AZURE_SEARCH_ENDPOINT=https://xxx.search.windows.net

# Blob Storage
# Azure Blob Storage
AZURE_STORAGE_ACCESS_KEY=xxx # Required if not using AAD Managed Identity
AZURE_STORAGE_ACCOUNT_NAME=xxx

# Application Insights
# Azure Application Insights
APPLICATIONINSIGHTS_CONNECTION_STRING=xxx
16 changes: 13 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,10 @@ test-static:
uv run deptry src

@echo "➡️ Test code smells (Ruff)..."
uv run ruff check --select I,PL,RUF,UP,ASYNC,A,DTZ,T20,ARG,PERF
uv run ruff check --select I,PL,RUF,UP,ASYNC,A,DTZ,T20,ARG,PERF --ignore A005

@echo "➡️ Test types (Pyright)..."
uv run pyright .
uv run pyright

test-unit:
bash cicd/test-unit-ci.sh
Expand All @@ -72,6 +72,16 @@ test-static-server:
@echo "➡️ Starting local static server..."
python3 -m http.server -d ./tests/websites 8000

test-aws-mock:
@echo "➡️ Starting AWS mock stack..."
docker run \
--interactive \
--rm \
-p 127.0.0.1:4510-4559:4510-4559 \
-p 127.0.0.1:4566:4566 \
-v /var/run/docker.sock:/var/run/docker.sock \
localstack/localstack

test-unit-run:
@echo "➡️ Unit tests (Pytest)..."
uv run pytest \
Expand All @@ -93,7 +103,7 @@ lint:
uv run ruff format

@echo "➡️ Lint with linter..."
uv run ruff check --select I,PL,RUF,UP,ASYNC,A,DTZ,T20,ARG,PERF --fix
uv run ruff check --select I,PL,RUF,UP,ASYNC,A,DTZ,T20,ARG,PERF --ignore A005 --fix

sbom:
@echo "🔍 Generating SBOM..."
Expand Down
81 changes: 70 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ Web scraper made for AI and simplicity in mind. It runs as a CLI that can be par

Shared:

- 🏗️ Decoupled architecture with [Azure Queue Storage](https://learn.microsoft.com/en-us/azure/storage/queues) or local [sqlite](https://sqlite.org)
- 🏗️ Decoupled architecture with [Azure Queue Storage](https://learn.microsoft.com/en-us/azure/storage/queues), [AWS SQS](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.htm) or local [sqlite](https://sqlite.org)
- ⚙️ Idempotent operations that can be run in parallel
- 💾 Scraped content is stored in [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs) or local disk
- 💾 Scraped content is stored in [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs), [AWS S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html) or local disk

Scraper:

Expand Down Expand Up @@ -80,10 +80,26 @@ export AZURE_STORAGE_ACCOUNT_NAME=xxx
scrape-it-now scrape run https://nytimes.com
```

Usage with AWS S3 and AWS SQS:

```bash
# AWS dependencies
export BLOB_PROVIDER=aws_s3
export QUEUE_PROVIDER=aws_sqs
# AWS configuration
export AWS_ACCESS_KEY_ID=xxx
export AWS_S3_ENDPOINT=my-bucket.s3.amazonaws.com
export AWS_SECRET_ACCESS_KEY=xxx
export AWS_SQS_ENDPOINT=sqs.eu-west-1.amazonaws.com
export AWS_SQS_REGION=eu-west-1
# Run the job
scrape-it-now scrape run https://nytimes.com
```

Usage with Local Disk Blob and Local Disk Queue:

```bash
# Local disk configuration
# Local disk dependencies
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
Expand All @@ -104,12 +120,10 @@ Example:
...
```
Most frequent options are:
Frequent general options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--azure-storage-access-key`</br>`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |
| `--azure-storage-account-name`</br>`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |
| `--blob-provider`</br>`-bp` | Blob provider | `BLOB_PROVIDER` |
| `--job-name`</br>`-jn` | Job name | `JOB_NAME` |
| `--max-depth`</br>`-md` | Maximum depth | `MAX_DEPTH` |
Expand All @@ -118,6 +132,23 @@ Most frequent options are:
| `--save-screenshot`</br>`-ss` | Save screenshot | `SAVE_SCREENSHOT` |
| `--whitelist`</br>`-w` | Whitelist | `WHITELIST` |
Frequent Azure options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--azure-storage-access-key`</br>`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |
| `--azure-storage-account-name`</br>`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |
Frequent AWS options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--aws-access-key-id`</br>`-aaki` | AWS access key ID | `AWS_ACCESS_KEY_ID` |
| `--aws-s3-endpoint`</br>`-ase` | AWS S3 endpoint | `AWS_S3_ENDPOINT` |
| `--aws-secret-access-key`</br>`-asak` | AWS secret access key | `AWS_SECRET_ACCESS_KEY` |
| `--aws-sqs-endpoint`</br>`-ase` | AWS SQS endpoint | `AWS_SQS_ENDPOINT` |
| `--aws-sqs-region`</br>`-asr` | AWS SQS region | `AWS_SQS_REGION` |
For documentation on all available options, run:
```bash
Expand Down Expand Up @@ -151,13 +182,26 @@ Example:
{"created_at":"2024-11-08T13:18:52.839060Z","last_updated":"2024-11-08T13:19:16.528370Z","network_used_mb":2.6666793823242188,"processed":1,"queued":311}
```
Most frequent options are:
Frequent general options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--blob-provider`</br>`-bp` | Blob provider | `BLOB_PROVIDER` |
Frequent Azure options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--azure-storage-access-key`</br>`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |
| `--azure-storage-account-name`</br>`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |
| `--blob-provider`</br>`-bp` | Blob provider | `BLOB_PROVIDER` |
Frequent AWS options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--aws-access-key-id`</br>`-aaki` | AWS access key ID | `AWS_ACCESS_KEY_ID` |
| `--aws-s3-endpoint`</br>`-ase` | AWS S3 endpoint | `AWS_S3_ENDPOINT` |
| `--aws-secret-access-key`</br>`-asak` | AWS secret access key | `AWS_SECRET_ACCESS_KEY` |
For documentation on all available options, run:
Expand Down Expand Up @@ -221,7 +265,14 @@ Example:
...
```
Most frequent options are:
Frequent general options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--blob-provider`</br>`-bp` | Blob provider | `BLOB_PROVIDER` |
| `--queue-provider`</br>`-qp` | Queue provider | `QUEUE_PROVIDER` |
Frequent Azure options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
Expand All @@ -234,8 +285,16 @@ Most frequent options are:
| `--azure-search-endpoint`</br>`-ase` | Azure Search endpoint | `AZURE_SEARCH_ENDPOINT` |
| `--azure-storage-access-key`</br>`-asak` | Azure Storage access key | `AZURE_STORAGE_ACCESS_KEY` |
| `--azure-storage-account-name`</br>`-asan` | Azure Storage account name | `AZURE_STORAGE_ACCOUNT_NAME` |
| `--blob-provider`</br>`-bp` | Blob provider | `BLOB_PROVIDER` |
| `--queue-provider`</br>`-qp` | Queue provider | `QUEUE_PROVIDER` |
Frequent AWS options are:
| `Options` | Description | `Environment variable` |
|-|-|-|
| `--aws-access-key-id`</br>`-aaki` | AWS access key ID | `AWS_ACCESS_KEY_ID` |
| `--aws-s3-endpoint`</br>`-ase` | AWS S3 endpoint | `AWS_S3_ENDPOINT` |
| `--aws-secret-access-key`</br>`-asak` | AWS secret access key | `AWS_SECRET_ACCESS_KEY` |
| `--aws-sqs-endpoint`</br>`-ase` | AWS SQS endpoint | `AWS_SQS_ENDPOINT` |
| `--aws-sqs-region`</br>`-asr` | AWS SQS region | `AWS_SQS_REGION` |
For documentation on all available options, run:
Expand Down
39 changes: 29 additions & 10 deletions cicd/test-unit-ci.sh
Original file line number Diff line number Diff line change
@@ -1,17 +1,36 @@
#!/bin/bash

# Start the first command in the background
make test-static-server 1>/dev/null 2>&1 &
write_header() {
lightcyan='\033[1;36m'
nocolor='\033[0m'
echo -e "${lightcyan}➡️ $1${nocolor}"
}

# Capture the PID of the background process
UNIT_RUN_PID=$!
cleanup() {
write_header "Cleaning up"
kill $aws_mock_pid
kill $static_server_pid
}

# Run the second command
make test-unit-run
exit_code=$?
# Unregister on success
trap 'cleanup; exit 0' EXIT
# Unregister on Ctrl+C
trap 'cleanup; exit 130' INT
# Unregister on SIGTERM
trap 'cleanup; exit 143' TERM

# Start AWS mock in background
write_header "Starting AWS mock"
make test-aws-mock 2>&1 &
aws_mock_pid=$!

# Once the second command exits, kill the first process
kill $UNIT_RUN_PID
# Start static server in background
write_header "Starting static server"
make test-static-server 2>&1 &
static_server_pid=$!

# Exit with the same exit code as the second command
# Run the unit tests
make test-unit-run
exit_code=$?
write_header "Unit tests finished"
exit $exit_code
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ name = "scrape-it-now"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"aiobotocore~=2.17",
"aiodns~=3.2",
"aiofiles~=24.1",
"aiohttp~=3.10",
Expand All @@ -42,6 +43,7 @@ dependencies = [
"azure-search-documents~=11.6a0",
"azure-storage-blob~=12.22",
"azure-storage-queue~=12.11",
"botocore~=1.35",
"click~=8.1",
"openai~=1.42",
"opentelemetry-instrumentation-aiohttp-client~=0.0a0",
Expand Down Expand Up @@ -122,3 +124,7 @@ docstring-code-format = true

[tool.pyright]
pythonVersion = "3.11"
include = [
"src",
"tests",
]
Loading

0 comments on commit ffb1ef5

Please sign in to comment.