🙈 Imaginative and prescient utilities for net interaction agents 🙈
🔗 Predominant put of living
•
🐦 Twitter
•
📢 Discord
Tarsier
Ought to you might perchance perhaps perhaps perchance even get tried utilizing an LLM to automate net interactions, you might perchance perhaps perhaps perchance even get perchance urge into questions like:
- How should always quiet you feed the webpage to an LLM? (e.g. HTML, Accessibility Tree, Screenshot)
- How compose you plot LLM responses aid to net parts?
- How will you repeat a text-easiest LLM about the accumulate page’s visible structure?
At Reworkd, we iterated on all these complications across tens of thousands of accurate net projects to present an impressive opinion plot for net agents… Tarsier!
In the video below, we spend Tarsier to give webpage opinion for a minimalistic GPT-4 LangChain net agent.
tarsier.mp4
How does it work?
Tarsier visually tags interactable parts on a net page by brackets + an ID e.g. [23]
.
In doing this, we present a mapping between parts and IDs for an LLM to steal actions upon (e.g. CLICK [23]
).
We clarify interactable parts as buttons, hyperlinks, or input fields that are considered on the accumulate page; Tarsier can furthermore stamp all textual parts whenever you dash tag_text_elements=True
.
Moreover, we’ve developed an OCR algorithm to convert a net page screenshot correct into a whitespace-structured string (nearly like ASCII art work) that an LLM even without imaginative and prescient can understand.
Since original imaginative and prescient-language items quiet lack magnificent-grained representations wished for net interaction projects, here’s serious.
On our inside of benchmarks, unimodal GPT-4 + Tarsier-Text beats GPT-4V + Tarsier-Screenshot by 10-20%!
Tagged Screenshot | Tagged Text Illustration |
---|---|
![]() |
![]() |
Set up
Utilization
Take a look at with our cookbook for agent examples utilizing Tarsier:
Otherwise, total Tarsier utilization might perchance perhaps also explore just like the next:
import asyncio from playwright.async_api import async_playwright from tarsier import Tarsier, GoogleVisionOCRService import json def load_google_cloud_credentials(json_file_path): with open(json_file_path) as f: credentials = json.load(f) return credentials async def main(): # To create the service account key, follow the instructions on this SO answer https://stackoverflow.com/a/46290808/1780891 google_cloud_credentials = load_google_cloud_credentials('./google_service_acc_key.json') ocr_service = GoogleVisionOCRService(google_cloud_credentials) tarsier = Tarsier(ocr_service) async with async_playwright() as p: browser = await p.chromium.launch(headless=False) page = await browser.new_page() await page.goto("https://news.ycombinator.com") page_text, tag_to_xpath = await tarsier.page_to_text(page) print(tag_to_xpath) # Mapping of tags to x_paths print(page_text) # My Text representation of the page if __name__ == '__main__': asyncio.run(main())
Relish in ideas that Tarsier tags a number of kinds of parts otherwise to abet your LLM establish what actions are performable on each and the total lot. Namely:
[#ID]
: text-insertable fields (e.g.textarea
,input
with textual form)[@ID]
: hyperlinks (tags)
[$ID]
: a number of interactable parts (e.g.button
,select
)[ID]
: undeniable text (whenever you dashtag_text_elements=True
)
Local Style
Setup
We get equipped a handy setup script to fetch you up and working with Tarsier trend.
Ought to you alter any TypeScript recordsdata feeble by Tarsier, you’ll must compose the next impart.
This compiles the TypeScript into JavaScript, that would also then be utilized in the Python package.
Sorting out
We spend pytest for attempting out. To urge the assessments, merely urge:
Linting
Sooner than submitting a doubtless PR, please urge the next to format your code:
Supported OCR Products and companies
- Google Cloud Imaginative and prescient
- Amazon Textract (Coming Quickly)
- Microsoft Azure Computer Imaginative and prescient (Coming Quickly)
Roadmap
- Add documentation and examples
- Dapper up interfaces and add unit assessments
- Launch
- Give a grab to OCR text performance
- Add alternatives to customize tagging styling
- Add toughen for diverse browsers drivers as needed
Citations
bibtex
@misc{reworkd2023tarsier,
title = {Tarsier},
author = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
year = {2023},
howpublished = {GitHub},
url = {https://github.com/reworkd/tarsier}
}
Leave a Reply