This repository demonstrates various multimodal AI capabilities using IBM's Granite Vision model, a powerful vision-language model that can understand and process both images and text.
IBM Granite Vision is a state-of-the-art multimodal AI model that can analyze images and respond to text prompts about them. This repository showcases several practical use cases:
- OCR (Optical Character Recognition): Extract text from documents, images, and scanned materials.
- HTML Generation: Convert webpage screenshots to functional HTML code.
- Flowchart Analysis: Generate descriptions or mermaid diagrams from flowchart images.
- Code Generation: Create code from class diagrams and other visual representations.
The repository includes a web application that provides an interactive interface to explore these capabilities:
- Image upload functionality
- Task selection
- Custom prompt options
- Formatted result display
- Download capabilities
- Python 3.7+
- IBM Cloud account with API key
- Project ID for IBM Granite Vision
-
Clone this repository:
git clone <repository-url> cd <repository-directory> -
Install the required dependencies:
pip install -r requirements.txt -
Create a
.envfile in the root directory with the following content:API_KEY="your-ibm-cloud-api-key" PROJECT_ID="your-project-id" IAM_IBM_CLOUD_URL=iam.cloud.ibm.com
Run the Flask app with:
python app.py
The application will be available at http://localhost:5000 in your web browser.
The Flask application (app.py) provides a responsive web interface with Bootstrap styling to interact with the IBM Granite Vision model.
Before using the app, you need to authenticate with IBM Cloud:
- Click the "Authenticate with IBM Cloud" button in the sidebar
- The app will use your API key from the
.envfile to obtain an access token - Once authenticated, the status will change to show success
If your token expires during use, you can click the "Refresh Token" button to obtain a new one.
The application has a responsive layout:
-
Task Selection: Choose from a dropdown of the four main tasks:
- OCR: Extract text from documents
- HTML Generation: Convert webpage screenshots to HTML
- Flowchart Analysis: Generate descriptions from diagrams
- Code Generation: Create code from class diagrams
-
Custom Prompt: Optionally enter a custom prompt to guide the model's response. If left empty, a default prompt will be used based on the selected task.
-
Image Upload: Upload an image in PNG, JPG, or JPEG format with real-time preview.
-
Process Button: Click "Process Image" to send the image to the IBM Granite Vision model.
The results are displayed with appropriate formatting:
-
OCR Results: Rendered as markdown with proper formatting
-
HTML Generation: Displayed as syntax-highlighted code with a "Render HTML" button to see the actual rendered webpage
-
Flowchart Analysis: Displayed with mermaid diagrams rendered inline if present
-
Code Generation: Code blocks are syntax-highlighted and can be viewed separately
- Real-time Image Preview: See a preview of your uploaded image before processing
- Syntax Highlighting: Using highlight.js for code blocks
- Markdown Rendering: Using marked.js for text formatting
- Mermaid Diagram Support: Automatic rendering of mermaid diagrams
- Responsive Design: Works well on desktop and mobile devices
- Download Option: Save results as text files
If you prefer to use the command line instead of the web interface, you can use the image_to_text_granite_vision.py script directly:
python image_to_text_granite_vision.py
This will process the example images in the examples directory and display the results in the terminal.
The examples directory contains sample images for each task:
ocr_document.png: A document for OCR testingwebpage.png: A screenshot of a webpage for HTML generationflowchart.png: A flowchart diagram for analysisclassdiagram.png: A class diagram for code generation
Extract text from documents, images, and scanned materials. The model can recognize text in various fonts, layouts, and styles.
Convert webpage screenshots to functional HTML code. The model analyzes the visual layout and generates corresponding HTML structure.
Generate descriptions or mermaid diagrams from flowchart images. The model understands the structure and relationships in diagrams.
Create code from class diagrams and other visual representations. The model can interpret UML diagrams and generate corresponding code.
The application uses IBM Granite Vision, a multimodal AI model that can process both images and text. The model is accessed through IBM Cloud's API, which requires authentication with an API key.
The workflow is as follows:
- The image is encoded as a base64 string
- A prompt is constructed based on the selected task
- The image and prompt are sent to the IBM Granite Vision API
- The API returns a text response, which is then processed and displayed
- Model: IBM Granite Vision 3.2 2B (
ibm/granite-vision-3-2-2b) - API: IBM Cloud ML API for text/chat
- Authentication: IBM Cloud IAM token authentication
- Frontend: Flask with Bootstrap for a responsive web application
- Image Processing: PIL/Pillow for image handling
- JavaScript Libraries:
- marked.js for markdown rendering
- highlight.js for syntax highlighting
- mermaid.js for diagram rendering
For production deployment, consider the following:
- Use a production WSGI server like Gunicorn or uWSGI instead of Flask's development server
- Set up proper error handling and logging
- Implement rate limiting to prevent API abuse
- Consider containerization with Docker for easier deployment
- Use environment variables for sensitive information
- If authentication fails, check your API key in the
.envfile. - If image processing fails, try refreshing your token or check your internet connection.
- For large images, processing may take longer than expected.



