For many, the first AI tool we encountered was Siri via Apple's iPhones. The AI-powered voice assistant was unveiled as part of the iPhone 4S's features in 2011. Whether it was helping us answer a call or set the alarm, Siri made life easier and was quite fun to interact with.
But in recent years, we haven't actually seen any major announcements regarding Siri. Now that AI is in the spotlight, especially after OpenAI's chatbot ChatGPT's launch, it is reported that Siri may also get smarter in the future. Reports that Apple is working on generative AI capabilities for Siri have been circulating for a while. Now a research paper published by Cornell University talks about a new MLLM (Multimodal Large Language Model) that might understand how a phone's user interface works.
The article, titled "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs," explains how the technology has come a long way but still has limitations when it comes to interacting with the user interface on screens. However, Ferret UI (which launched last October) is an MLLM that is being developed to understand UI screens and understand how apps on a phone work.
The MLLM, according to the article, may also have "referencing, anchoring, and reasoning capabilities." One of the primary challenges in improving AI's understanding of app screens lies in the different aspect ratios and compact visuals found on smartphone screens. Ferret-UI tackles this obstacle by magnifying details and leveraging enhanced visual features to understand even the smallest icons and buttons.
The article also mentions that through careful training, Ferret-UI has outperformed existing models in the ability to understand and interact with app interfaces. If Ferret UI is incorporated into Apple's voice assistant Siri, we can expect it to make the tool even smarter. In the future, the digital assistant can perform complex tasks within apps. Imagine instructing Siri to book a flight or make a reservation, and Siri seamlessly interacts with the corresponding app to fulfill the request.
When it comes to Ferret, it is an open-source, multimodal large language model that was released between Apple and Cornell University as a result of extensive research into how large language models could recognize and understand elements within images. This means that a user interface with Ferret can handle queries like those for ChatGPT or Gemini. Ferret was launched for research purposes last October.