Relate HN: Tarsier – imaginative and prescient for text-easiest LLM net agents that beats GPT-4o

Relate HN: Tarsier – imaginative and prescient for text-easiest LLM net agents that beats GPT-4o

Tarsier Monkey

🙈 Imaginative and prescient utilities for net interaction agents 🙈


Python
Version

🔗 Predominant put of living
  •  
🐦 Twitter
  •  
📢 Discord

Tarsier

Ought to you might perchance perhaps perhaps perchance even get tried utilizing an LLM to automate net interactions, you might perchance perhaps perhaps perchance even get perchance urge into questions like:

  • How should always quiet you feed the webpage to an LLM? (e.g. HTML, Accessibility Tree, Screenshot)
  • How compose you plot LLM responses aid to net parts?
  • How will you repeat a text-easiest LLM about the accumulate page’s visible structure?

At Reworkd, we iterated on all these complications across tens of thousands of accurate net projects to present an impressive opinion plot for net agents… Tarsier!
In the video below, we spend Tarsier to give webpage opinion for a minimalistic GPT-4 LangChain net agent.

tarsier.mp4


How does it work?

Tarsier visually tags interactable parts on a net page by brackets + an ID e.g. [23].
In doing this, we present a mapping between parts and IDs for an LLM to steal actions upon (e.g. CLICK [23]).
We clarify interactable parts as buttons, hyperlinks, or input fields that are considered on the accumulate page; Tarsier can furthermore stamp all textual parts whenever you dash tag_text_elements=True.

Moreover, we’ve developed an OCR algorithm to convert a net page screenshot correct into a whitespace-structured string (nearly like ASCII art work) that an LLM even without imaginative and prescient can understand.
Since original imaginative and prescient-language items quiet lack magnificent-grained representations wished for net interaction projects, here’s serious.
On our inside of benchmarks, unimodal GPT-4 + Tarsier-Text beats GPT-4V + Tarsier-Screenshot by 10-20%!

Tagged Screenshot Tagged Text Illustration
tagged tagged

Set up

Utilization

Take a look at with our cookbook for agent examples utilizing Tarsier:

Otherwise, total Tarsier utilization might perchance perhaps also explore just like the next:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
import json

def load_google_cloud_credentials(json_file_path):
    with open(json_file_path) as f:
        credentials = json.load(f)
    return credentials

async def main():
    # To create the service account key, follow the instructions on this SO answer https://stackoverflow.com/a/46290808/1780891
    google_cloud_credentials = load_google_cloud_credentials('./google_service_acc_key.json')

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Relish in ideas that Tarsier tags a number of kinds of parts otherwise to abet your LLM establish what actions are performable on each and the total lot. Namely:

Local Style

Setup

We get equipped a handy setup script to fetch you up and working with Tarsier trend.

Ought to you alter any TypeScript recordsdata feeble by Tarsier, you’ll must compose the next impart.
This compiles the TypeScript into JavaScript, that would also then be utilized in the Python package.

Sorting out

We spend pytest for attempting out. To urge the assessments, merely urge:

Linting

Sooner than submitting a doubtless PR, please urge the next to format your code:

Supported OCR Products and companies

Roadmap

  • Add documentation and examples
  • Dapper up interfaces and add unit assessments
  • Launch
  • Give a grab to OCR text performance
  • Add alternatives to customize tagging styling
  • Add toughen for diverse browsers drivers as needed

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

Be taught More


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *