Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting Icons #32

Open
llermaly opened this issue Jan 21, 2024 · 4 comments
Open

Selecting Icons #32

llermaly opened this issue Jan 21, 2024 · 4 comments

Comments

@llermaly
Copy link

llermaly commented Jan 21, 2024

Hi! ,

I'm trying to automate using the search bar on a list of unknown sites.

In most cases the bar is not visible but there is an icon I must click before to display the search bar.

This example, I want to detect and click the magnifying glass:

image

The problem is it shows this way in the text [ @ 18 ] so GPT can not pick it (I'm using the llamaindex agent)

The website is https://elastic.co

I read @asim-shrestha mentions GPT-V mode in another issue but I'm not sure on how activate that one, I'm following the docs without success.

Any advice? thanks

@asim-shrestha
Copy link
Contributor

Hey @llermaly, because Tarsier is typically for text parsing, we currently don't support icons. (Not sure how we'd best go about it in the future either)

For images it is quite straight forward. There is a page_to_image function in tarsier that will return the bytes of the image. Then you can pass that in as an image to a vision language model likeGPT-4-V. Let me know if that helps!

@asim-shrestha
Copy link
Contributor

If you still want to go the text approach, you can manually find out which of those elements may be related to a search icon (through image name, or some other tag in the html itself) and provide that information in the prompt as well

@tvatter
Copy link

tvatter commented Jul 11, 2024

@llermaly I'm trying a combination of text based extraction with Tarsier and direct html parsing. Using playwright, one can do something like

elements = []
for role in ["img", "button"]: # other interesting aria roles
  elements.extend(await page.get_by_role(role).all())

for element in elements:
    properties = await element.evaluate("""
      el => {
          const isClickableTag = ['A', 'BUTTON', 'INPUT', 'AREA', 'SELECT', 'TEXTAREA'].includes(el.tagName);
          const hasClickRole = el.getAttribute('role') === 'button' || el.getAttribute('role') === 'link';
          const hasPointerCursor = window.getComputedStyle(el).cursor === 'pointer';
                                        
          return {
              tagName: el.tagName,
              textContent: el.textContent.trim(),
              className: el.className,
              isClickableTag: isClickableTag,
              hasClickRole: hasClickRole,
              hasPointerCursor: hasPointerCursor,
              isClickable: isClickableTag || hasClickRole || hasPointerCursor
          };
      }
  """)

And you can then use isClickable to figure out the clickable elements.

@asim-shrestha
Copy link
Contributor

Cool stuff! We might expose that information directly as a part of Tarsier @tvatter

Happy to take a PR if that would be of interest to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants