[MacOS] add ability to send images to the model and not just text - The model can now summarize and describe images, useful to work with screen shots #104

Joaov41 · 2025-01-18T17:15:37Z

Image detection and storage logic is inserted into the showPopup() method. This block is responsible for checking the clipboard for image data and storing it.

let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL]
This array lists the clipboard types the code considers as image data.

for type in supportedTypes {
if let data = pasteboard.data(forType: type) {
print("[DEBUG] Found image data of size (data.count) bytes for type: (type.rawValue)")
foundImages.append(data)
}
}
Iterates over each supported image type.
If data for that type is found on the clipboard, it prints a debug message and adds the data to the foundImages array.

self.appState.selectedImages = foundImages
After detecting any images, this line saves the collected image data into the selectedImages property of AppState.

The code block appears after the simulation of Cmd+C and after retrieving any text from the clipboard.
It comes right before showing the popup window, ensuring that any detected images are stored in AppState for later use.
Full:

// MARK: - SHOW POPUP (Image handling added)

private func showPopup() {
    DispatchQueue.main.async { [weak self] in
        guard let self = self else { return }

        self.appState.geminiProvider.cancel()
        self.closePopupWindow()

        let pasteboard = NSPasteboard.general
        let oldContents = pasteboard.string(forType: .string) ?? ""

        // Simulate Cmd+C
        let source = CGEventSource(stateID: .hidSystemState)
        let keyDown = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: true)
        let keyUp = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: false)
        keyDown?.flags = .maskCommand
        keyUp?.flags = .maskCommand
        keyDown?.post(tap: .cghidEventTap)
        keyUp?.post(tap: .cghidEventTap)

        DispatchQueue.main.asyncAfter(deadline: .now() + 0.2) { [weak self] in
            guard let self = self else { return }
            let selectedText = pasteboard.string(forType: .string) ?? ""

            // If no new text was selected, fallback to oldContents
            let textToProcess = selectedText.isEmpty ? oldContents : selectedText
            self.appState.selectedText = textToProcess

            // NEW: Attempt to detect images on the clipboard
            var foundImages: [Data] = []
            let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL]
            
            // Debug: print all pasteboard types
            let allTypes = pasteboard.types ?? []
            print("[DEBUG] Pasteboard types: \(allTypes.map(\.rawValue))")
            
            for type in supportedTypes {
                if let data = pasteboard.data(forType: type) {
                    print("[DEBUG] Found image data of size \(data.count) bytes for type: \(type.rawValue)")
                    foundImages.append(data)
                }
            }
            
            // Store images in AppState
            self.appState.selectedImages = foundImages

            // Show the popup window
            let window = PopupWindow(appState: self.appState)
            window.delegate = self
            self.popupWindow = window

            window.positionNearMouse()
            window.makeKeyAndOrderFront(nil)
            window.orderFrontRegardless()
        }
    }
}

The text was updated successfully, but these errors were encountered:

Aryamirsepasi · 2025-01-19T16:10:29Z

Image detection and storage logic is inserted into the showPopup() method. This block is responsible for checking the clipboard for image data and storing it.

let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL] This array lists the clipboard types the code considers as image data.

for type in supportedTypes { if let data = pasteboard.data(forType: type) { print("[DEBUG] Found image data of size (data.count) bytes for type: (type.rawValue)") foundImages.append(data) } } Iterates over each supported image type. If data for that type is found on the clipboard, it prints a debug message and adds the data to the foundImages array.

self.appState.selectedImages = foundImages After detecting any images, this line saves the collected image data into the selectedImages property of AppState.

The code block appears after the simulation of Cmd+C and after retrieving any text from the clipboard. It comes right before showing the popup window, ensuring that any detected images are stored in AppState for later use. Full:

// MARK: - SHOW POPUP (Image handling added)
private func showPopup() {
    DispatchQueue.main.async { [weak self] in
        guard let self = self else { return }

        self.appState.geminiProvider.cancel()
        self.closePopupWindow()

        let pasteboard = NSPasteboard.general
        let oldContents = pasteboard.string(forType: .string) ?? ""

        // Simulate Cmd+C
        let source = CGEventSource(stateID: .hidSystemState)
        let keyDown = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: true)
        let keyUp = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: false)
        keyDown?.flags = .maskCommand
        keyUp?.flags = .maskCommand
        keyDown?.post(tap: .cghidEventTap)
        keyUp?.post(tap: .cghidEventTap)

        DispatchQueue.main.asyncAfter(deadline: .now() + 0.2) { [weak self] in
            guard let self = self else { return }
            let selectedText = pasteboard.string(forType: .string) ?? ""

            // If no new text was selected, fallback to oldContents
            let textToProcess = selectedText.isEmpty ? oldContents : selectedText
            self.appState.selectedText = textToProcess

            // NEW: Attempt to detect images on the clipboard
            var foundImages: [Data] = []
            let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL]
            
            // Debug: print all pasteboard types
            let allTypes = pasteboard.types ?? []
            print("[DEBUG] Pasteboard types: \(allTypes.map(\.rawValue))")
            
            for type in supportedTypes {
                if let data = pasteboard.data(forType: type) {
                    print("[DEBUG] Found image data of size \(data.count) bytes for type: \(type.rawValue)")
                    foundImages.append(data)
                }
            }
            
            // Store images in AppState
            self.appState.selectedImages = foundImages

            // Show the popup window
            let window = PopupWindow(appState: self.appState)
            window.delegate = self
            self.popupWindow = window

            window.positionNearMouse()
            window.makeKeyAndOrderFront(nil)
            window.orderFrontRegardless()
        }
    }
}

Hi, thanks for using the app. This is a very interesting idea! I'll definitely look into it and add it in a future update. Thanks!

Joaov41 · 2025-01-19T19:38:40Z

I have already implemented on the version running on my machine. If you want I can send it to you for you to check. Also modified the copy function so that it copies the entire conversation not just the latest message, for me personal use case it is better.

Aryamirsepasi · 2025-01-20T22:09:25Z

That's great news! If you want, you can create a fork of the main project and push your changes there. I can then check them out.

Joaov41 · 2025-01-22T21:38:26Z

Done
https://github.com/Joaov41/WritingTools_mac_vission/

Have testes throughly with Gemini flash 2.0 and it works very well. The only problem is with the outlook app , when forwarding or replying to an email, for some reason, the text is not detected. On the emails themselves work just fine. Also tried all the apps that I use, and also works fine so there is something about the emails on the outlook app that is interfering. Still investigating.

Joaov41 · 2025-01-23T13:34:04Z

Fixed the problem on the outlook app, so now images are detected in all apps and no longer text is being recognized as image which resulted in error on the outlook app.

alexxthekidd · 2025-01-23T15:58:14Z

@Joaov41 how do we run it?

Aryamirsepasi · 2025-01-23T16:09:38Z

Fixed the problem on the outlook app, so now images are detected in all apps and no longer text is being recognized as image which resulted in error on the outlook app.

Thank you! I'll review it today and include your contribution in the next version.

Joaov41 · 2025-01-23T19:15:10Z

I pushed another update. On the previous version, images, would only work on the "Describe you changes", which meant users would have to type that they wanted the llm to do with the image. I have incorporated the image capability on the regular options menu like summarize, key points, etc. That way the code will treat the image as if it were regular text. Of course this works better with images that have text on it. But I have tested with images wit non text and, at least with Gemini, it works very will, if no text the llm just describes the image according the option request. It even creates a nice table with the description of the image. Very impressive.

Aryamirsepasi · 2025-01-23T19:43:23Z

Thanks! I've integrated your excellent code into the upcoming release, likely by the end of the week. I've also credited you as a macOS version contributor in the next update.

The next version so far includes:

Image Processing via Gemini (Thanks to @Joaov41)
Direct Mistral Support
Provider selection in Onboarding Window
Closing the Popup via ESC Key

I've also been working on adding German, Spanish, French, and Russian Languages to the app but that might be available in version 3 and not 2.

Joaov41 · 2025-02-04T00:07:46Z

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option.
very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Aryamirsepasi · 2025-02-04T14:56:00Z

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Great work! This could be very helpful and significantly increase the app's usefulness. I haven't had time to review your code yet, but I'll add the feature to the TODO list for version 4. Since version 3 is nearly complete, it would be more efficient to include this new capability in the next version.

alexxthekidd · 2025-02-04T14:59:30Z

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Great work! This could be very helpful and significantly increase the app's usefulness. I haven't had time to review your code yet, but I'll add the feature to the TODO list for version 4. Since version 3 is nearly complete, it would be more efficient to include this new capability in the next version.

May I ask what is to be expected in v4?

alexxthekidd · 2025-02-04T15:08:52Z

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

you mean it will extract the content of the url acting as a scraper?

Aryamirsepasi · 2025-02-04T16:06:16Z

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Great work! This could be very helpful and significantly increase the app's usefulness. I haven't had time to review your code yet, but I'll add the feature to the TODO list for version 4. Since version 3 is nearly complete, it would be more efficient to include this new capability in the next version.

May I ask what is to be expected in v4?

Sure!
This is the current roadmap:

V3 (Arriving This Week):

Integrated Local LLM (Llama3.2 3B so that it can run on more devices and be faster for text rewriting tasks, I might add more LLM support in future, because it is really easy via MLX).
Image and PDF OCR Support across all LLMs. The images part will work like current Gemini implementation but less accurate because it only extracts the texts from Images. A new attach button is added to the response window to ask follow up questions about a subject and adding a pdf or image to it.
Closing the popup via ESC Key
More compact Popup design
Bug fixes on Popup opening

V4 (TBA):

Joaov41’s new URL scraping
Localization (Finally:D)
Text streaming in response window
Ability to select specific apps for the popup shortcut to not work in.

Concepts for the future (Some might be added to V4 or later versions):

Ability to ask questions upon selecting a pdf or image in Finder and opening the popup on them. This is merely a concept for now but I am working on it.
Local LLM Store (A place today download Local LLMs and use them in the app)
Web search in response window (Duckduckgo)
Deep Research (This is currently a separate project I am working on, if it works and I can get good results via LocalLLMs I might integrate it, but it might also make the app too complicated).

Joaov41 · 2025-02-04T16:22:23Z

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

you mean it will extract the content of the url acting as a scraper?

Exactly.

Joaov41 · 2025-02-06T17:50:15Z

You mentioned pdf support , I had already been working on the pdf support when I started working on the image support and URL scrapping but it was not working yet. I think I have it now although I followed a different approach. It is not through Finder, user can simply use the context menu on the pdf file ( right click - copy) the app will extract the text from a pdf and sent it to the llm for use along with the usual options. So this version includes image support, Url scrapping and pdf all through the clipboard. I have updated my fork in case you want to check it out.

Joaov41 · 2025-02-07T15:36:42Z

Another update, I was thinking, I use a lot of video screenshots in my daily workflow, and since the gemini models support video, why not add video functionality to the app? it will be very useful for my use case. So I have updated the code. Gemini only of course.
Now the app can handle text, images, pdf, URL and video. Same function as the pdf or url's, on the file-context menu - copy- and then just use the shortcut to invoke the option menu.

theJayTea · 2025-02-07T21:46:56Z

@Joaov41 , that's some really cool work :)

Here's a small suggestion, let me know what you think!
When one invokes the Writing Tools UI and you detect an image/video/URL/PDF, it may be worth adding a checkbox of "Include Copied Image", "Include Copied Video", "Read Copied URL", or "Read Copied PDF" (depending on what you found).

This would solve 2 things:

Most users have old stuff copied to their clipboard all the time — after taking a screenshot, that goes to the clipboard, and after sharing a URL with a friend, that'd remain in the clipboard. It's not too often that the average user would want the LLM in Writing Tools to read/look at that content.
It would make things more intuitive. Right now, only Gemini supports video & images, and in the future if you decide to add it to OpenAI, only OpenAI and Gemini would support images, etc.
If that checkbox shows up only when using Gemini/another multimodal model that supports the content, then it's clear to the user when the clipboard content reading feature is active and not:
When there's no checkbox showing up, it's clear that the feature isn't present.

Again, awesome work! :]

Joaov41 · 2025-02-08T00:12:10Z

Thank you for your kind words and thank you for your awesome suggestions, they are great. The code already clears the clipboard after every request, that very valid concern would be only when the app starts and the user has something on the clipboard.
I will implement your suggestions and also clear the clipboard when the app starts and as keeping the current functionality of clearing after each request.
Thanks again

Aryamirsepasi · 2025-02-09T17:26:48Z

Hello everyone, I regret to announce that v3 will not be released this week. I'm currently occupied with my exams. I've decided to postpone the update until after my exams are over and then merge v3 and v4 into a single, comprehensive update.

Sorry, and thank you for your understanding.

theJayTea · 2025-02-10T04:39:55Z

All the best with your exams, Arya! No worries at all.

Aryamirsepasi added the macOS version suggestion label Jan 19, 2025

theJayTea added macOS version discussion macOS version suggestion and removed macOS version suggestion labels Jan 21, 2025

Aryamirsepasi closed this as completed Jan 26, 2025

theJayTea reopened this Feb 4, 2025

theJayTea removed the macOS version suggestion label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MacOS] add ability to send images to the model and not just text - The model can now summarize and describe images, useful to work with screen shots #104

[MacOS] add ability to send images to the model and not just text - The model can now summarize and describe images, useful to work with screen shots #104

Joaov41 commented Jan 18, 2025 •

edited

Loading

Aryamirsepasi commented Jan 19, 2025

Joaov41 commented Jan 19, 2025

Aryamirsepasi commented Jan 20, 2025

Joaov41 commented Jan 22, 2025

Joaov41 commented Jan 23, 2025

alexxthekidd commented Jan 23, 2025

Aryamirsepasi commented Jan 23, 2025

Joaov41 commented Jan 23, 2025

Aryamirsepasi commented Jan 23, 2025 •

edited

Loading

Joaov41 commented Feb 4, 2025 •

edited

Loading

Aryamirsepasi commented Feb 4, 2025

alexxthekidd commented Feb 4, 2025

alexxthekidd commented Feb 4, 2025

Aryamirsepasi commented Feb 4, 2025

Joaov41 commented Feb 4, 2025

Joaov41 commented Feb 6, 2025

Joaov41 commented Feb 7, 2025

theJayTea commented Feb 7, 2025

Joaov41 commented Feb 8, 2025

Aryamirsepasi commented Feb 9, 2025

theJayTea commented Feb 10, 2025

[MacOS] add ability to send images to the model and not just text - The model can now summarize and describe images, useful to work with screen shots #104

[MacOS] add ability to send images to the model and not just text - The model can now summarize and describe images, useful to work with screen shots #104

Comments

Joaov41 commented Jan 18, 2025 • edited Loading

Aryamirsepasi commented Jan 19, 2025

Joaov41 commented Jan 19, 2025

Aryamirsepasi commented Jan 20, 2025

Joaov41 commented Jan 22, 2025

Joaov41 commented Jan 23, 2025

alexxthekidd commented Jan 23, 2025

Aryamirsepasi commented Jan 23, 2025

Joaov41 commented Jan 23, 2025

Aryamirsepasi commented Jan 23, 2025 • edited Loading

Joaov41 commented Feb 4, 2025 • edited Loading

Aryamirsepasi commented Feb 4, 2025

alexxthekidd commented Feb 4, 2025

alexxthekidd commented Feb 4, 2025

Aryamirsepasi commented Feb 4, 2025

Joaov41 commented Feb 4, 2025

Joaov41 commented Feb 6, 2025

Joaov41 commented Feb 7, 2025

theJayTea commented Feb 7, 2025

Joaov41 commented Feb 8, 2025

Aryamirsepasi commented Feb 9, 2025

theJayTea commented Feb 10, 2025

Joaov41 commented Jan 18, 2025 •

edited

Loading

Aryamirsepasi commented Jan 23, 2025 •

edited

Loading

Joaov41 commented Feb 4, 2025 •

edited

Loading