Foundational model (LLM) for visual computer desktop automation and understanding - for building AutoGPT agent #6948
alexandre-emmanuel
started this conversation in
General
Replies: 1 comment
-
I tihnk there are models for it but I don't know if any of the are publicly available. Maybe OpenInterpreter? I think this would be really cool to integrate into autogpt though |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When I am trying to come up with idea about AutoGPT agent, then it should have computer screen - desktop understanding and manipulation capabilities. Specific tool usage (e.g. outputtint script in file and running cmd with it) is not enough, because the most tools are intended to be used visually.
My question is - there foundational model for computer desktop understanding and, possibly, manipulation.
How do I imagine it? E.g. AutoGPT action can caption current screen/desktop as a picture and ask this foundational model to segment it and to output textual content for each segment. Then this textual content can be sent to other LLM which (together with the context - e.g. history of previous steps and previous conversations) is able to to determine which action to execute. E.g. there can be segment with the button "Open Options menu" and (some steps forward) there can be segment "Compile" (button) and then AutoGPT can decided to position cursor over it and click.
Or something similar can be done by ther DD a component from the tool palette to the design surface of some ID (be it software programming IDE or AutoCAD archictectural IDE).
So - are there any such multimedia LLM? It can be visual LLM which just understands desktop and describes it and then more detailed understanding of the desktop can be done by other LLM (textual). Or it can be single visual-textual LLM that both segments and understands desktop?
I have heard about Ferret model from Apple https://technewsbynovy.medium.com/apple-discreetly-released-an-open-source-multimodal-llm-in-october-681cda4a65d9 but I have not tried it.
I just wanted to check with the community about thinking, personal discoveries and knowledge in this direction.
I guess, ChatGPT-4 can process visual pictures with lot of sophistication, but I prefer the open source models, because I plan to train and improve them as well.
Thx.
Beta Was this translation helpful? Give feedback.
All reactions