Foundational model (LLM) for visual computer desktop automation and understanding - for building AutoGPT agent #6948

alexandre-emmanuel · 2024-02-29T22:46:09Z

alexandre-emmanuel
Feb 29, 2024

When I am trying to come up with idea about AutoGPT agent, then it should have computer screen - desktop understanding and manipulation capabilities. Specific tool usage (e.g. outputtint script in file and running cmd with it) is not enough, because the most tools are intended to be used visually.

My question is - there foundational model for computer desktop understanding and, possibly, manipulation.

How do I imagine it? E.g. AutoGPT action can caption current screen/desktop as a picture and ask this foundational model to segment it and to output textual content for each segment. Then this textual content can be sent to other LLM which (together with the context - e.g. history of previous steps and previous conversations) is able to to determine which action to execute. E.g. there can be segment with the button "Open Options menu" and (some steps forward) there can be segment "Compile" (button) and then AutoGPT can decided to position cursor over it and click.

Or something similar can be done by ther DD a component from the tool palette to the design surface of some ID (be it software programming IDE or AutoCAD archictectural IDE).

So - are there any such multimedia LLM? It can be visual LLM which just understands desktop and describes it and then more detailed understanding of the desktop can be done by other LLM (textual). Or it can be single visual-textual LLM that both segments and understands desktop?

I have heard about Ferret model from Apple https://technewsbynovy.medium.com/apple-discreetly-released-an-open-source-multimodal-llm-in-october-681cda4a65d9 but I have not tried it.

I just wanted to check with the community about thinking, personal discoveries and knowledge in this direction.

I guess, ChatGPT-4 can process visual pictures with lot of sophistication, but I prefer the open source models, because I plan to train and improve them as well.

Thx.

ntindle · 2024-03-14T00:51:37Z

ntindle
Mar 14, 2024
Maintainer

I tihnk there are models for it but I don't know if any of the are publicly available. Maybe OpenInterpreter? I think this would be really cool to integrate into autogpt though

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foundational model (LLM) for visual computer desktop automation and understanding - for building AutoGPT agent #6948

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Foundational model (LLM) for visual computer desktop automation and understanding - for building AutoGPT agent #6948

alexandre-emmanuel Feb 29, 2024

Replies: 1 comment

ntindle Mar 14, 2024 Maintainer

alexandre-emmanuel
Feb 29, 2024

ntindle
Mar 14, 2024
Maintainer