EntendedMGIE - MLLM-Guided Image Editing with Progressive Feature Blending and Cross-Attention Masking
Welcome to the official implementation of the advanced MGIE framework, which now integrates Progressive Feature Blending (PFB), Cross-Attention Masking (CAM), Identity Embeddings (IE), and Gaussian Blurring (GB) for unparalleled text-driven image editing. Utilizing the robust capabilities of Multimodal Large Language Models (MLLMs), our enhanced framework not only ensures realistic and coherent image modifications but also meticulously preserves the identity and spatial consistency of the edited regions. These improvements enable highly detailed, semantically aligned, and controllable edits, guided by sophisticated visual-aware instructions, setting a new standard in the field of image manipulation.
Architecture of the xMGIE framework
The MLLM-Guided Image Editing (MGIE) framework is designed to revolutionize text-driven image editing by leveraging Multimodal Large Language Models (MLLMs) to generate detailed, visually-aware instructions. While the original MGIE framework was impressive, our enhancements with Progressive Feature Blending (PFB), Cross-Attention Masking (CAM), Identity Embeddings (IE), and Gaussian Blurring (GB) elevate its capabilities to a new level of precision and realism.
- Progressive Feature Blending (PFB): Seamlessly integrates MLLM-generated content with the original image across multiple feature levels, ensuring visual coherence and consistency.
- Cross-Attention Masking (CAM): Provides precise control over the editing process by focusing the influence of specific text tokens on desired image regions.
-
- Identity Embeddings (IE): Preserves the identity and key characteristics of objects and individuals in the image, maintaining their distinctive features throughout the editing process.
- Gaussian Blurring (GB): Enhances spatial coherence and natural blending of edited regions with the original image through spatially-varying Gaussian blur techniques.
Extensive experiments and analyses demonstrate that our enhanced MGIE framework outperforms previous methods in terms of visual quality, semantic alignment, and faithfulness to the original image.
For a deep dive into our methodology, experiments, and results, check out our paper titled "Enhancing MLLM-Guided Image Editing with Progressive Feature Blending and Cross-Attention Masking." The paper is available in the paper/
directory and offers comprehensive insights into our techniques and their contributions.
Explore the implementation of the enhanced MGIE framework in the code/
directory. Our primary implementation is provided in the mgie_implementation.ipynb
Jupyter notebook, which includes step-by-step instructions for training and testing the framework on various datasets. Ensure you have all dependencies listed in the requirements.txt
file.
Check out our results/
directory to see sample input images and their corresponding edited outputs generated by the enhanced MGIE framework. The input_images/
subdirectory contains the original images, while the output_images/
subdirectory showcases the edited versions, highlighting the effectiveness of our framework.
Ready to dive in? Follow these steps to set up and start using the enhanced MGIE framework.
- Python 3.6 or above
- PyTorch 1.9 or above
- CUDA 11.0 or above (for GPU acceleration)
-
Clone the repository:
git clone https://github.com/your-username/ml-mgie-implementation.git cd ml-mgie-implementation Install the required dependencies:
-
Install the required dependencies:
pip install -r code/requirements.txt
-
Download the pretrained models and datasets:
- Place the pretrained MLLM model (e.g., LLaVA-7B) in the
models/
directory. - Place the desired datasets (e.g., COCO, CUB, Oxford-102 Flowers) in the
data/
directory.
- Place the pretrained MLLM model (e.g., LLaVA-7B) in the
- Open the
code/mgie_implementation.ipynb
Jupyter notebook(Inference). - Follow the instructions to train and test the framework on your chosen datasets.
- Modify the notebook as needed to experiment with different settings and hyperparameters.
- Provide input images, text prompts, and binary masks (if applicable) to generate edited images.
- Find your edited images saved in the
results/output_images/
directory.
We welcome contributions! If you encounter issues or have suggestions for improvement, please open an issue or submit a pull request. Adhere to the existing code style and provide detailed explanations of your changes.
We extend our gratitude to the original MGIE framework and PFB-Diff method authors for their foundational work in text-driven image editing. Thanks also to the open-source community for providing essential tools and libraries used in this implementation.