The goal of this project was to develop an API for an eCommerce clothing store to prevent duplicate products from being added to their database. The system checks for similar images in the company's database and returns the top matches, helping to identify duplicate listings. The API takes an image URL as input and returns the most similar images from the database along with the match percentage.
The API, (when hosted on Elastic Beanstalk) allows users to query for similar images using the following format:
http://<your-environment-url>/find_similar/?image_url=<image_url>&top=<number_of_results>
<your-environment-url>
: The URL of your AWS Elastic Beanstalk environment.<image_url>
: The URL of the image you want to check for duplicates.<number_of_results>
: The number of top similar images to return (default is 1).
For example:
http://imagesearch3-dev.ap-south-1.elasticbeanstalk.com/find_similar/?image_url=https://d1it09c4puycyh.cloudfront.net/707x1000/catalog/product/6/6/6619-RED_1.jpg&top=3
The output of the API call would resemble this:
To begin, you will need to create a SageMaker domain and then set up a notebook instance to execute the code provided. Follow the instructions below to complete this setup:
- Go to AWS SageMaker and create a SageMaker domain.
- Create a notebook instance in the domain where you will execute the provided scripts.
- Refer to the official AWS SageMaker documentation for detailed steps: AWS SageMaker Getting Started Guide.
- Once the domain and notebook are set up, create a DynamoDB table and then proceed to use the
similar_image_checker_runner_scripts/processing_template_phash.ipynb
file.
You need to create a DynamoDB table to store the perceptual hashes (pHash) for each image. In all the scripts, I have used the hash_store
table. Here’s the structure of the table:
- Table Name:
hash_store
- Primary Key (Partition Key):
entity_id
(string) – a unique identifier for each image. - Attributes:
sku
: Stock Keeping Unit (SKU) for identifying the product.small_image
: URL to the image.phash
: Perceptual hash value of the image stored as a hexadecimal string.
The table structure will depend on your dataset, this structure is for the dataset provided in this repository. Replace the table name in the repository scripts with the your table name in the next steps.
Once the table is created, you can use the SageMaker notebook environment to run similar_image_checker_runner_scripts/processing_template_phash.ipynb
. This script processes images in batches and populates the DynamoDB table by generating a pHash for each image.
After processing all batches, a single item in the table would resemble this:
Running a nearest neighbor search directly on DynamoDB can be inefficient and costly, as it requires scanning through each row. To optimize this, we use OpenSearch for fast and efficient searches using indexes.
- Create an OpenSearch domain using the following tutorial, which covers the steps under the free tier: OpenSearch Domain Creation Tutorial.
- Replace the index name in the repository scripts with the your index name in the next steps.
After setting up OpenSearch, use the similar_image_checker_runner_scripts/dynamo_to_opensearch.ipynb
notebook to populate your OpenSearch index with the data stored in DynamoDB. This notebook script extracts the items from DynamoDB, converts the pHash values into binary vectors, and pushes them into OpenSearch.
The script includes functions to:
- Create the OpenSearch index with K-Nearest Neighbors (KNN) enabled for efficient similarity search.
- Scan DynamoDB in batches and convert the pHash values into a vector representation.
- Send bulk requests to populate OpenSearch with the converted data.
The view_opensearch_data()
function will show you 10 items that were added to your OpenSearch index, use this to verify data transfer to your OpenSearch index.
Once you finish running this script, your OpenSearch index will be populated with the pHash vectors from DynamoDB.
Now that you have set up your OpenSearch endpoint, you can deploy the API as a FastAPI application on AWS Elastic Beanstalk. Following are the instructions for setting up the project locally, running the application, and deploying it to AWS Elastic Beanstalk.
Before you begin, ensure you have the following installed:
- Python 3.7 - 3.11
- AWS CLI
- AWS Elastic Beanstalk CLI (EB CLI)
- Git
-
Clone the Repository
git clone https://github.com/Entro01/imageSearch.git cd imageSearch
-
Create and Activate a Virtual Environment
python -m venv env source env/bin/activate
-
Install Dependencies
Make sure
requirements.txt
is updated with all necessary dependencies. Install them using:pip install -r requirements.txt
-
Configuration
Before deploying, you need to modify the application_phash.py file to configure the OpenSearch settings:
- Open application_phash.py.
- Locate the OpenSearch configuration section and update the opensearch_url and auth with your OpenSearch endpoint and authentication details.
opensearch_url =
"https://your-opensearch-endpoint"
auth = HTTPBasicAuth('your-username', 'your-password')For example:
python
# OpenSearch configuration opensearch_url = "https://search-imagehash-beqqt46rp2xv6agh7tohq5it7i.aos.us-east-1.on.aws" auth = HTTPBasicAuth('admin', '1234')
-
Run the Application Locally
fastapi dev application_phash.py
Visit
http://127.0.0.1:8000
in your browser to access the application.
-
Configure AWS CLI
Ensure your AWS CLI is configured with your AWS credentials:
aws configure
Follow the prompts to enter your AWS Access Key ID, Secret Access Key, region, and output format.
-
Initialize Elastic Beanstalk
Run the following command in your project directory:
eb init
Follow the prompts to set up your Elastic Beanstalk application. Select the appropriate region, application name, and platform (Python).
-
Create an Environment and Deploy
To create an environment and deploy your application, run:
eb create your-environment-name
Replace
your-environment-name
with a name for your environment.To deploy changes to an existing environment, use:
eb deploy
-
Open the Application
Once the deployment is complete, you can open your application in a web browser:
eb open
- use eb logs to diagnose issues incase the deployment fails
You can send queries to your deployed application to find similar images. Use the following URL format:
http://<your-environment-url>/find_similar/?image_url=<image_url>&top=<number_of_results>
<your-environment-url>: Replace with the URL of your Elastic Beanstalk environment.
<image_url>: The URL of the image you want to find similar images for.
<number_of_results>: The number of KNN neighbors (results) to retrieve (default value: 1).
For example:
bash
http://imagesearch3-dev.ap-south-1.elasticbeanstalk.com/find_similar/?image_url=https://d1it09c4puycyh.cloudfront.net/707x1000/catalog/product/6/6/6619-RED_1.jpg&top=3