Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MacOS] add ability to send images to the model and not just text - The model can now summarize and describe images, useful to work with screen shots #104

Open
Joaov41 opened this issue Jan 18, 2025 · 21 comments

Comments

@Joaov41
Copy link

Joaov41 commented Jan 18, 2025

Image detection and storage logic is inserted into the showPopup() method. This block is responsible for checking the clipboard for image data and storing it.

let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL]
This array lists the clipboard types the code considers as image data.

for type in supportedTypes {
if let data = pasteboard.data(forType: type) {
print("[DEBUG] Found image data of size (data.count) bytes for type: (type.rawValue)")
foundImages.append(data)
}
}

Iterates over each supported image type.
If data for that type is found on the clipboard, it prints a debug message and adds the data to the foundImages array.

self.appState.selectedImages = foundImages
After detecting any images, this line saves the collected image data into the selectedImages property of AppState.

The code block appears after the simulation of Cmd+C and after retrieving any text from the clipboard.
It comes right before showing the popup window, ensuring that any detected images are stored in AppState for later use.
Full:

// MARK: - SHOW POPUP (Image handling added)

private func showPopup() {
    DispatchQueue.main.async { [weak self] in
        guard let self = self else { return }

        self.appState.geminiProvider.cancel()
        self.closePopupWindow()

        let pasteboard = NSPasteboard.general
        let oldContents = pasteboard.string(forType: .string) ?? ""

        // Simulate Cmd+C
        let source = CGEventSource(stateID: .hidSystemState)
        let keyDown = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: true)
        let keyUp = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: false)
        keyDown?.flags = .maskCommand
        keyUp?.flags = .maskCommand
        keyDown?.post(tap: .cghidEventTap)
        keyUp?.post(tap: .cghidEventTap)

        DispatchQueue.main.asyncAfter(deadline: .now() + 0.2) { [weak self] in
            guard let self = self else { return }
            let selectedText = pasteboard.string(forType: .string) ?? ""

            // If no new text was selected, fallback to oldContents
            let textToProcess = selectedText.isEmpty ? oldContents : selectedText
            self.appState.selectedText = textToProcess

            // NEW: Attempt to detect images on the clipboard
            var foundImages: [Data] = []
            let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL]
            
            // Debug: print all pasteboard types
            let allTypes = pasteboard.types ?? []
            print("[DEBUG] Pasteboard types: \(allTypes.map(\.rawValue))")
            
            for type in supportedTypes {
                if let data = pasteboard.data(forType: type) {
                    print("[DEBUG] Found image data of size \(data.count) bytes for type: \(type.rawValue)")
                    foundImages.append(data)
                }
            }
            
            // Store images in AppState
            self.appState.selectedImages = foundImages

            // Show the popup window
            let window = PopupWindow(appState: self.appState)
            window.delegate = self
            self.popupWindow = window

            window.positionNearMouse()
            window.makeKeyAndOrderFront(nil)
            window.orderFrontRegardless()
        }
    }
}
@Aryamirsepasi
Copy link
Collaborator

Image detection and storage logic is inserted into the showPopup() method. This block is responsible for checking the clipboard for image data and storing it.

let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL] This array lists the clipboard types the code considers as image data.

for type in supportedTypes { if let data = pasteboard.data(forType: type) { print("[DEBUG] Found image data of size (data.count) bytes for type: (type.rawValue)") foundImages.append(data) } } Iterates over each supported image type. If data for that type is found on the clipboard, it prints a debug message and adds the data to the foundImages array.

self.appState.selectedImages = foundImages After detecting any images, this line saves the collected image data into the selectedImages property of AppState.

The code block appears after the simulation of Cmd+C and after retrieving any text from the clipboard. It comes right before showing the popup window, ensuring that any detected images are stored in AppState for later use. Full:

// MARK: - SHOW POPUP (Image handling added)

private func showPopup() {
    DispatchQueue.main.async { [weak self] in
        guard let self = self else { return }

        self.appState.geminiProvider.cancel()
        self.closePopupWindow()

        let pasteboard = NSPasteboard.general
        let oldContents = pasteboard.string(forType: .string) ?? ""

        // Simulate Cmd+C
        let source = CGEventSource(stateID: .hidSystemState)
        let keyDown = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: true)
        let keyUp = CGEvent(keyboardEventSource: source, virtualKey: 0x08, keyDown: false)
        keyDown?.flags = .maskCommand
        keyUp?.flags = .maskCommand
        keyDown?.post(tap: .cghidEventTap)
        keyUp?.post(tap: .cghidEventTap)

        DispatchQueue.main.asyncAfter(deadline: .now() + 0.2) { [weak self] in
            guard let self = self else { return }
            let selectedText = pasteboard.string(forType: .string) ?? ""

            // If no new text was selected, fallback to oldContents
            let textToProcess = selectedText.isEmpty ? oldContents : selectedText
            self.appState.selectedText = textToProcess

            // NEW: Attempt to detect images on the clipboard
            var foundImages: [Data] = []
            let supportedTypes: [NSPasteboard.PasteboardType] = [.tiff, .png, .pdf, .fileURL]
            
            // Debug: print all pasteboard types
            let allTypes = pasteboard.types ?? []
            print("[DEBUG] Pasteboard types: \(allTypes.map(\.rawValue))")
            
            for type in supportedTypes {
                if let data = pasteboard.data(forType: type) {
                    print("[DEBUG] Found image data of size \(data.count) bytes for type: \(type.rawValue)")
                    foundImages.append(data)
                }
            }
            
            // Store images in AppState
            self.appState.selectedImages = foundImages

            // Show the popup window
            let window = PopupWindow(appState: self.appState)
            window.delegate = self
            self.popupWindow = window

            window.positionNearMouse()
            window.makeKeyAndOrderFront(nil)
            window.orderFrontRegardless()
        }
    }
}

Hi, thanks for using the app. This is a very interesting idea! I'll definitely look into it and add it in a future update. Thanks!

@Joaov41
Copy link
Author

Joaov41 commented Jan 19, 2025

I have already implemented on the version running on my machine. If you want I can send it to you for you to check. Also modified the copy function so that it copies the entire conversation not just the latest message, for me personal use case it is better.

@Aryamirsepasi
Copy link
Collaborator

That's great news! If you want, you can create a fork of the main project and push your changes there. I can then check them out.

@Joaov41
Copy link
Author

Joaov41 commented Jan 22, 2025

Done
https://github.com/Joaov41/WritingTools_mac_vission/

Have testes throughly with Gemini flash 2.0 and it works very well. The only problem is with the outlook app , when forwarding or replying to an email, for some reason, the text is not detected. On the emails themselves work just fine. Also tried all the apps that I use, and also works fine so there is something about the emails on the outlook app that is interfering. Still investigating.

@Joaov41
Copy link
Author

Joaov41 commented Jan 23, 2025

Fixed the problem on the outlook app, so now images are detected in all apps and no longer text is being recognized as image which resulted in error on the outlook app.

@alexxthekidd
Copy link

@Joaov41 how do we run it?

@Aryamirsepasi
Copy link
Collaborator

Fixed the problem on the outlook app, so now images are detected in all apps and no longer text is being recognized as image which resulted in error on the outlook app.

Thank you! I'll review it today and include your contribution in the next version.

@Joaov41
Copy link
Author

Joaov41 commented Jan 23, 2025

I pushed another update. On the previous version, images, would only work on the "Describe you changes", which meant users would have to type that they wanted the llm to do with the image. I have incorporated the image capability on the regular options menu like summarize, key points, etc. That way the code will treat the image as if it were regular text. Of course this works better with images that have text on it. But I have tested with images wit non text and, at least with Gemini, it works very will, if no text the llm just describes the image according the option request. It even creates a nice table with the description of the image. Very impressive.

@Aryamirsepasi
Copy link
Collaborator

Aryamirsepasi commented Jan 23, 2025

Thanks! I've integrated your excellent code into the upcoming release, likely by the end of the week. I've also credited you as a macOS version contributor in the next update.

The next version so far includes:

  • Image Processing via Gemini (Thanks to @Joaov41)
  • Direct Mistral Support
  • Provider selection in Onboarding Window
  • Closing the Popup via ESC Key

I've also been working on adding German, Spanish, French, and Russian Languages to the app but that might be available in version 3 and not 2.

@Joaov41
Copy link
Author

Joaov41 commented Feb 4, 2025

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option.
very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

@Aryamirsepasi
Copy link
Collaborator

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Great work! This could be very helpful and significantly increase the app's usefulness. I haven't had time to review your code yet, but I'll add the feature to the TODO list for version 4. Since version 3 is nearly complete, it would be more efficient to include this new capability in the next version.

@alexxthekidd
Copy link

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Great work! This could be very helpful and significantly increase the app's usefulness. I haven't had time to review your code yet, but I'll add the feature to the TODO list for version 4. Since version 3 is nearly complete, it would be more efficient to include this new capability in the next version.

May I ask what is to be expected in v4?

@alexxthekidd
Copy link

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

you mean it will extract the content of the url acting as a scraper?

@Aryamirsepasi
Copy link
Collaborator

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

Great work! This could be very helpful and significantly increase the app's usefulness. I haven't had time to review your code yet, but I'll add the feature to the TODO list for version 4. Since version 3 is nearly complete, it would be more efficient to include this new capability in the next version.

May I ask what is to be expected in v4?

Sure!
This is the current roadmap:

V3 (Arriving This Week):

  • Integrated Local LLM (Llama3.2 3B so that it can run on more devices and be faster for text rewriting tasks, I might add more LLM support in future, because it is really easy via MLX).
  • Image and PDF OCR Support across all LLMs. The images part will work like current Gemini implementation but less accurate because it only extracts the texts from Images. A new attach button is added to the response window to ask follow up questions about a subject and adding a pdf or image to it.
  • Closing the popup via ESC Key
  • More compact Popup design
  • Bug fixes on Popup opening

V4 (TBA):

  • Joaov41’s new URL scraping
  • Localization (Finally:D)
  • Text streaming in response window
  • Ability to select specific apps for the popup shortcut to not work in.

Concepts for the future (Some might be added to V4 or later versions):

  • Ability to ask questions upon selecting a pdf or image in Finder and opening the popup on them. This is merely a concept for now but I am working on it.
  • Local LLM Store (A place today download Local LLMs and use them in the app)
  • Web search in response window (Duckduckgo)
  • Deep Research (This is currently a separate project I am working on, if it works and I can get good results via LocalLLMs I might integrate it, but it might also make the app too complicated).

@Joaov41
Copy link
Author

Joaov41 commented Feb 4, 2025

Have been working on another update. a URL now can be copied to clipboard through the share extension, when invoking the shortcut, the code will check if on the clipboard there is an url, if so it will extract the content of it and then it I can be used as normally with the summary, key points, or any other option or custom option. very useful for my use case along with the image support already implemented. If you want to check it out, I updated my fork for you to check. At first I tried with the share sheet, but it woulds not work with all apps. macOS being finicky. Pushed this upgrade to the iOS version as well

you mean it will extract the content of the url acting as a scraper?

Exactly.

@Joaov41
Copy link
Author

Joaov41 commented Feb 6, 2025

You mentioned pdf support , I had already been working on the pdf support when I started working on the image support and URL scrapping but it was not working yet. I think I have it now although I followed a different approach. It is not through Finder, user can simply use the context menu on the pdf file ( right click - copy) the app will extract the text from a pdf and sent it to the llm for use along with the usual options. So this version includes image support, Url scrapping and pdf all through the clipboard. I have updated my fork in case you want to check it out.

@Joaov41
Copy link
Author

Joaov41 commented Feb 7, 2025

Another update, I was thinking, I use a lot of video screenshots in my daily workflow, and since the gemini models support video, why not add video functionality to the app? it will be very useful for my use case. So I have updated the code. Gemini only of course.
Now the app can handle text, images, pdf, URL and video. Same function as the pdf or url's, on the file-context menu - copy- and then just use the shortcut to invoke the option menu.

Image

@theJayTea
Copy link
Owner

@Joaov41 , that's some really cool work :)

Here's a small suggestion, let me know what you think!
When one invokes the Writing Tools UI and you detect an image/video/URL/PDF, it may be worth adding a checkbox of "Include Copied Image", "Include Copied Video", "Read Copied URL", or "Read Copied PDF" (depending on what you found).

This would solve 2 things:

  1. Most users have old stuff copied to their clipboard all the time — after taking a screenshot, that goes to the clipboard, and after sharing a URL with a friend, that'd remain in the clipboard. It's not too often that the average user would want the LLM in Writing Tools to read/look at that content.

  2. It would make things more intuitive. Right now, only Gemini supports video & images, and in the future if you decide to add it to OpenAI, only OpenAI and Gemini would support images, etc.
    If that checkbox shows up only when using Gemini/another multimodal model that supports the content, then it's clear to the user when the clipboard content reading feature is active and not:
    When there's no checkbox showing up, it's clear that the feature isn't present.

Again, awesome work! :]

@Joaov41
Copy link
Author

Joaov41 commented Feb 8, 2025

Thank you for your kind words and thank you for your awesome suggestions, they are great. The code already clears the clipboard after every request, that very valid concern would be only when the app starts and the user has something on the clipboard.
I will implement your suggestions and also clear the clipboard when the app starts and as keeping the current functionality of clearing after each request.
Thanks again

@Aryamirsepasi
Copy link
Collaborator

Hello everyone, I regret to announce that v3 will not be released this week. I'm currently occupied with my exams. I've decided to postpone the update until after my exams are over and then merge v3 and v4 into a single, comprehensive update.

Sorry, and thank you for your understanding.

@theJayTea
Copy link
Owner

All the best with your exams, Arya! No worries at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants