.
├── data
│ ├── processed
│ │ └── v1
│ ├── raw
│ │ └── v1
│ ├── scripts
│ │ ├── extract.py
│ │ ├── transform.py
│ │ └── load.py
│ └── main.py
├── deploy
│ ├── scripts
│ └── tests
├── develop
│ ├── artifacts
│ ├── eda
│ ├── notebooks
│ ├── scripts
│ │ ├── eda.py
│ │ └── load.py
│ └── main.py
├── label
├── train
├── visualise
├── .flake8
├── .gitattributes
├── .gitignore
├── .pre-commit-config.yaml
├── Makefile
├── poetry.toml
├── pyproject.toml
└── README.md
- Initially, I attempted to gather data directly from the application by identifying the endpoints that the application communicates with on the backend. However, this approach proved to be labor-intensive, so I decided to skip it.
- Instead, I obtained data relevant to the task from Kaggle. You can access the dataset here. It comprises approximately 140 CSV files, each containing valuable attributes such as
main_category
andimage
of the product, which will be instrumental for our task. - Finally, ingest the data in the created instance on
GCP
into/data/raw
directory
-
Extract only the desired attribute from all 140 CSV files which was
main_category
,sub_category
andimage_url
and save the output of this phase into/data/processed/v1
def process_all_csv( source: str, file_names: List[str], cols: str,destination_base: str ):
-
Combine output of the previous step into single file with about 400k entries
-
Remove duplicated url after merging
-
As an intermediate step to handle large dataset and process it i have decide to implement a function to divide any given list into chunks and process from asynchronously in the following steps
def chunk_list(lst, chunk_size): for i in range(0, len(lst), chunk_size): yield lst[i : i + chunk_size]
-
Reduce Labels there was about more that 14 category, i've reduce them into 7 category
-
Implement Async Function to validate the url of each Image
@classmethod async def is_valid_url(cls, url: str, session: aiohttp.ClientSession) -> bool: try: async with session.get(url) as response: response.raise_for_status() return True except ClientResponseError: return False
this step helps to reduce the number of entries from 400k to 200K
-
Then fetching image of each entry and persist it into the instance that have been created in the Cloud
- Divide the large list into chuncks
- Process each chunk asynchronously to get the images from the remote server
async def get_image_from_links(cls, urls: List[str]) -> List[bytes]: async with aiohttp.ClientSession() as session: tasks = [ImageCollector.fetch_image(url, session) for url in urls] return await asyncio.gather(*tasks) @classmethod async def fetch_image(cls, url: str, session: aiohttp.ClientSession) -> bytes: try: async with session.get(url) as response: response.raise_for_status() return await response.read() except ClientResponseError: return None
- persist the collected images into the disk
def persist_images(images: List[bytes], dirs: List[str]) -> bool: for image, dir in zip(images, dirs): img = parse_image_content(image) result = save_cvimage(img, dir) if not result: print(f"Faild To Persist {os.path.basename(dir)}")
- parse the binary data that have been collect to 3d numpy arrays using
OpenCV
- save the parsed image into the disk
- parse the binary data that have been collect to 3d numpy arrays using
-
Push the data to kaggle
# that create json file for configuring datasets kaggle datasets init -p /path/to/data # create the dataset after set the metadata # using -u argument to be public kaggle datasets create --dir-mode tar -p /path/to/data -u
-
Finally output
here you can access the public dataset on kaggle that resulting from the previous steps
I have decided to use Tensorflow
rather than Pytorch
duo to time constraints and our use case that will include:
-
Transfer Learning: Utilize pre-trained models available in TensorFlow Hub or models trained on large datasets like ImageNet. Fine-tuning these models for our specific task can significantly reduce training time and data requirements.
-
TensorFlow Extended (TFX): building end-to-end machine learning pipelines for production, consider using TensorFlow Extended (TFX). TFX provides a suite of tools for building, deploying, and maintaining production-ready ML pipelines.
-
TensorBoard: TensorFlow comes with TensorBoard, a powerful visualization tool that helps you monitor and debug your models. Utilize TensorBoard for visualizing metrics, model graphs, embeddings, and more to gain insights into your model's behavior.
-
Model Serving and Deployment: due to deployment is a significant concern, TensorFlow provides tools like TensorFlow Serving, TensorFlow Lite, and TensorFlow.js for deploying models in various environments, including cloud, mobile, and web.
Build A Wrapper Class to ingest the images from the saved directory into the model
-
Process given path of any image and return the label and the image as tensor objects
@classmethod def process_path(cls, file_path: pathlib.Path): label = ImageProcessor.get_label(file_path) img = tf.io.read_file(file_path) img = ImageProcessor.decode_img(img) return img, label
-
ImageDataset class is the wrapper class that utilize the previous function to prepare the dataset and provide API to split the dataset into
train
,validation
,andtest
splitsclass ImageDataSet: def __init__( self, path: str, train_size: float, test_size: float, val_size: float ) -> None:
def get_train_val_test(self, batch_size, width, height) -> Tuple[tf.data.Dataset]:
inside the
get_train_val_test
i have utiliz another functions fromImageProcessor
such as@classmethod def prepare_for_training(cls, ds: tf.data.Dataset, shuffle_buffer_size=1000): ds = ds.shuffle(buffer_size=shuffle_buffer_size) ds = ds.batch(ImageProcessor.BATCH_SIZE) ds = ds.prefetch(buffer_size=ImageProcessor.AUTOTUNE) return ds
this function take the dataset object and shuffle it then divide it into batches and prefetch it for training efficiency
I have leverged the power transfer learning
and used the architecture of ResNet with imageNet weights
also I have tried another architecture called VGG19
with the mentioned weights.
Building 4 different architectures using the previous ones as staring point and tried to adjust them into the our data
here is the initial insights:
Model_v1
is perfoming well according to the shown metrics- Performance of
Model_v2
during training process - The choosen data have some issues with data labelling so we have to do label correction to enhance the results from this baseline
Tracking all these experiments without using Tensorboard
and Weights & Bias
would be trouble indeed :(