🗝️ Quick Bytes:
The AI wars heat up with Claude 3, claimed to have “near-human” abilities
Anthropic has launched Claude 3, a new AI language model series comprising Claude 3 Haiku, Sonnet, and Opus. These models demonstrate enhanced cognitive abilities, approaching "near-human" levels in some tasks, and are available through various subscription and API access options. Opus, the most advanced model, requires a $20 monthly subscription.
Claude 3 models show improvements over previous versions in areas like analysis, content creation, and multilingual conversations. They also feature enhanced vision capabilities, processing visual formats like photos and charts. The models vary in pricing, with Opus being the most expensive at $15 per million input tokens and $75 per million output tokens.
The development of Claude 3 included the use of synthetic data for training, resulting in significant capability gains. Anthropic plans frequent updates to the models, focusing on features like interactive coding and advanced capabilities, while maintaining a commitment to safety and minimizing catastrophic risk.
OpenAI says Elon Musk wanted ‘absolute control’ of the company
OpenAI intends to dismiss Elon Musk's lawsuit, countering his claim that the company deviated from its non-profit mission. OpenAI's response highlights Musk's previous desire for significant control over OpenAI, including merging it with Tesla, gaining majority equity, board control, and CEO position.
Musk's lawsuit alleges OpenAI, a company he initially helped fund, has become overly commercialized and acts as a "closed-source de facto subsidiary" of Microsoft. OpenAI disputes this, maintaining that Musk was informed and agreeable to the company becoming less open as it advanced towards artificial general intelligence.
OpenAI's blog post refutes Musk's assertions about the proprietary nature of GPT-4 and its alignment with Microsoft. The company also shared an email exchange from 2016, revealing a mutual understanding between Musk and OpenAI executives about limiting open-source practices as AGI development progresses.
Tumblr’s owner is striking deals with OpenAI and Midjourney for training data, says report
Automattic, the owner of Tumblr and WordPress.com, is reportedly in negotiations with AI firms Midjourney and OpenAI to provide user post data for AI training. An anonymous source claims these deals are close to completion, following recent speculation on Tumblr about a potential revenue-generating partnership with Midjourney.
Automattic is said to be introducing a new user setting to opt out of data sharing with third parties, including AI companies. However, concerns have been raised about a previously made "initial data dump" from Tumblr, which allegedly included all public posts from 2014 to 2023 and possibly some non-public content.
While Automattic has not confirmed specifics, they have acknowledged working with AI companies under conditions aligning with community interests in attribution and control. This approach reflects a broader trend where companies balance user privacy with the adoption of AI technologies, a strategy that has led to mixed reactions within the creative community.
🎛️ Algorithm Command Line
I am almost 100% sure that you were in the situation where you shared your prompt with another person and got the answer like:
→ “What should I do now?”
→ “Which brackets should I fill with my data?”
→ “Should I divide it or do something else?”
etc.
It’s completely normal. I encountered this situation many times, especially on my AI workshops and when I share prompts designed for my clients. Sometimes these instructions are longer than average stories for children’s books.
Yes, I know what kind of face you will do when you will see them. But… you can reduce this friction and use this small trick.
It will allow you to just share the link, and if another person pastes it in the browser and clicks enter, it will start the interaction immediately.
Without overthinking.
Here’s how it works:
Depending on which version of ChatGPT you're using, the starting part of the web address will be different but the method is the same.
→ GPT4 - ?model=gpt-4&q=your+prompt+pasted+here
→ custom GPT - ?q=your+prompt+pasted+here
At the end of the URL, after the "q=", type your question, topic or prompt. Replace the "your+prompt+pasted+here" part with your actual prompt.
Like I do in the video below.
I bet with this trick you could make a prompt even your grandma could understand.
💡Explained
This week’s Explained section is carried out by Pano Evangeliou, a Senior AI Engineer at Chaptr, specializing in LLMs. He prepared for you a long-form article, explaining the recent paper „…”, and sharing his thoughts and experience.
Dive in the text and be sure to connect with Pano on LinkedIn!
Can a foundation model meaningfully interact with humans and generalize across different environments? An embodied multi-modal agent framework.
The pursuit of AGI
The Interactive Agent Foundation Model (IAMF) is Microsoft's latest audacious push toward the holy grail of AI aspirations: Artificial General Intelligence (AGI). The proposed framework is a vision of a future where AI can mimic human cognition, learning, and interaction seamlessly across various domains. The genius of the IAMF lies in its ability to navigate through, learn from, and adapt to a variety of environments, from the virtual landscapes of video games to the tangible realities of our physical world, implying potential revolutions in sectors as diverse as healthcare, education, and even robotics.
Can a foundation model generalize? Grounding and multi-modalities
A common issue with foundation models is often struggling with generalizing to new tasks, commonly underperforming or producing inaccuracies, a problem attributed to their reliance on vast internet datasets and a lack of real-world grounding. This issue is compounded when these models, which are not fine-tuned across different modalities, are used as frozen in multimodal systems.
The Interactive Agent Foundation Model (IAMF) offers a solution by introducing a unified pre-training framework that processes text, visuals, and actions, treating each input as distinct tokens to be predicted across modalities. This method not only corrects the grounding issue, reducing inaccuracies but also facilitates the integration of information across different modalities. Moreover, training a single neural model across many tasks and modalities significantly improves scalability. All of the above, inch further away from the task-specific AI of today, towards the AGI of tomorrow.
How to make the model interactive? Transitioning to agent-based systems
IAFM is designed as a dynamic, agent-based system, setting it apart from traditional, static AI models. This shift ensures that the agent is an active participant in their environment, meaningfully interacting with their surroundings and adapting their behavior to real-time unknown and diverse situations. Similar to a human, yet to a much higher capacity, it learns from a diverse mix of sources - combining words, visuals, and actions - to develop a deep, multifaceted understanding of the world. Whether it’s mastering the intricate details of robotics, captivating players in virtual game worlds, or providing custom assistance in healthcare settings. A key difference between the IAFM approach and existing interactive agents is that the agent’s action will directly impact task planning, as the agent does not need to receive feedback from the environment to plan its next actions.
Is embodiment possible? Operationalizing agents
As the authors highlight, AI is at a “pivotal historical juncture” where agent technology is shifting from simulation to real-world applications. Imagine an autonomous robot assistant that can directly communicate with non-expert humans, adapt to the environment, and seamlessly execute useful tasks. To achieve this, the researchers propose the Embodied Agent Paradigm. This paradigm views embodied agents as members of a collaborative system, where they interact with humans and their environment using vision-language capabilities and employ a vast set of actions based on human needs, thereby mitigating cumbersome tasks in both virtual reality and the physical world.
Structured around five principal modules—Agent in Environment and Perception, Agent Learning, Memory, Action, and Cognition and Consciousness— the embodied agents navigate obstacles, manipulate objects, and engage in complex interactions, showcasing the tangible impact of simulation-based training. On the other hand, there are yet many challenges to achieving an embodied AI agent, mainly regarding understanding the complex dynamics of multi-modal systems in the physical world.
How did they do it? A unified pre-training framework
Tokenization: The researchers propose a unified tokenization framework as a general pre-training strategy for predicting input tokens. For text tokens, the standard language modeling task with the next token prediction is used. For actions, the vocabulary of the language model is expanded to include special “agent” tokens that represent each of the actions available to the language model. Finally, visual tokens are incorporated into the framework by training a visual encoder to predict masked visual tokens.
Model architecture: The model is designed to process multi-modal information using 5 main components. First, a visual encoder that takes visual input, e.g. video, and encodes it into visual tokens. Then, a language model is tasked with handling textual instructions and generation. Between those two components exists an additional linear layer that transforms the visual tokens into the token embedding space of the language transformer. The fourth component is the action encoder which uses cognition and reinforcement learning to anticipate the most suitable actions based on current states and inputs. Finally, the learning & memory component is responsible enables the agent to accumulate experiences, recall past interactions, and leverage this knowledge in future tasks.
Input sequence: Thus, given a text prompt and a single video frame, we can obtain a text or action token prediction. This input sequence is extended - within a sliding window - with the history of previous instructions, videos, and actions. All model components are jointly trained, unlike the limitation in previous visual-language models that largely rely upon frozen submodules and seek to learn an adaptation network for cross-modal alignment.
You may have noticed that RLHF is not proposed in the model training. That is exactly because reinforcement learning is already effectively implemented in the model architecture and training. The input sequence itself includes the environment state and past actions, and the model learns by interacting with its environment.
📖 Domain experimentation: Gaming, Robotics, and Healthcare
To evaluate its effectiveness as a general-purpose tool, IAFM is tested in three major agent-AI scenarios: robotics, gaming, and healthcare. Initially, the model is trained on pre-training splits from all datasets. Then, for each task, the model is separately fine-tuned and evaluated against the corresponding dataset splits.
Robotics: human-machine manipulation in the physical world
Two existing datasets (Language-table, CALVIN) are selected, consisting of language instructions, video frames, and actions. Custom action tokens are created. The fine-tuned model is evaluated on action prediction guided by language instructions for manipulation tasks. While the pre-trained model performs better than when trained from scratch, it is outperformed by other models, probably due to limited initial pre-training.
Gaming: human-machine embodiment in virtual
The gaming datasets consist of Minecraft and Bleeding Edge demonstrations where the video gameplay is synchronized with player actions and inventory metadata. GPT-4V is used to label videos with natural language instructions. Custom action tokens are created for discrete button and mouse actions. The fine-tuned model is evaluated for its action prediction ability and its image reconstruction quality. The results show that fine-tuning a diverse pre-trained model is significantly more effective than training from scratch.
Healthcare: augmented human-machine interaction in traditional multimodal tasks
For these tasks, the main dataset consisted of hospital ICU-recorded scenes annotated by trained nurses. GPT-4 is used to generate a synthetic video question-answer dataset, avoiding the use of confidential patient data. The dataset also contains nursing and clinical documentation. The fine-tuned model is evaluated for the tasks of video captioning, video question-answering, and action recognition. The results show that, compared to traditional cross-modal baseline models, jointly pre-training on robotics and gaming data improves the performance for action recognition, but does not improve text generation abilities (increased perplexity).
The most important findings from this experimentation
Pre-training the model across all different tasks boosts performance for action prediction across all gaming and robotics datasets. This highlights the importance of a diverse pre-training mixture.
The effectiveness of the model increases when jointly pre-training already pre-trained vision-language models compared to training from scratch.
So, maybe you could be wondering if you can become a better doctor by playing a lot of Minecraft? Of course not. But our models do! In other words, to build a generalist agent foundation model, it is important to train it with a lot of different data and across diverse tasks and skills.
🤖 Synergies
The Interactive Agent Foundation Model draws inspiration from foundational advancements in AI, specifically in foundation models, multimodal understanding, and agent-based AI. The authors synthesized lessons from large-scale pre-training in language and vision (e.g., GPT series, Alpaca, Flamingo, BLIP), integrated insights from efficient vision-language models, and from agent-based AI research that leverages reinforcement learning. By integrating these diverse strands of AI, they aim to address the limitations of previous models by enhancing grounding and cross-modal learning, leading to the creation of adaptable and interactive AI agents.
🌊 The new AI wave
The paper introduces a transformative AI approach, envisioning agents as versatile partners that learn, adapt, and collaborate with humans across various fields. The Interactive Agent Foundation Model is a leap toward AI that provides practical assistance, utilizing advanced perception, planning, and interaction capabilities. This model is poised to revolutionize sectors by improving decision-making, and task execution, and offering innovations starting from robotics, healthcare, and gaming and extending to myriads of other domains; from enhancing financial markets and supply chains to data analysis and surgical assistance. The emergence of AGI-driven models like IAFM signals a pivotal shift in our interaction with technology.
“We believe in a future where every machine that moves will be autonomous…We are building the Foundation Agent - a generally capable AI that learns to act skillfully in many worlds, virtual and real. 2024 is the year of Robotics, the Year of Gaming AI, and the Year of Simulation” Jim Fan, NVIDIA, GEAR group.
🗞️ Longreads
Large language models can do jaw-dropping things. But nobody knows exactly why. And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models. (read)