At the Consumer Electronics Show, there was a big talk about the Rabbit R1, an AI gadget that acts like a personal assistant. The CEO of Rabbit, Jesse Lyu, showed it off in a way that reminded people of famous tech leaders. But it's not just the gadget that's interesting, it's the way people are thinking about AI.
People are getting really excited about AI, like Rabbit R1 and Microsoft's Copilot. These technologies are doing more than just simple tasks; they are starting to make semi-autonomus decisions for us.
Big names in tech, like Marc Andreessen, are saying AI can change the world for the better. But the more you look at it, the more it seems like some people are treating AI like it's more than just technology.
It's almost like they see it as a powerful force that can solve all our problems.
This raises a big question: Are we treating AI like it's a kind of cult?
It's important to think about this as AI becomes a bigger part of our lives.
🗝️ Quick Bytes:
Meta releases free ‘Code Llama 70B’ to challenge GPT-4 in AI coding race
On Monday, Meta announced the release of Code Llama 70B, a free, open-source AI model for code generation with 70 billion parameters. Code Llama 70B is one of the largest models available for generating programming code and is designed to create longer and more complex code than previous Meta models.
In benchmarks, Code Llama 70B achieved 53% accuracy on the HumanEval test, approaching the 67% accuracy reported for OpenAI's GPT-4 model, showing Meta is closing the gap with GPT-4 for AI-assisted coding. Meta CEO Mark Zuckerberg emphasized Code Llama 70B's superior performance over models like GPT-3.5 and the importance of democratizing access to powerful AI coding tools.
The release of the free Code Llama 70B has sparked discussion about the potential impact of such models on software development.
Anthropic confirms it suffered a data leak
AI startup Anthropic recently suffered a data breach when a contractor mistakenly sent a file containing customer names and accounts receivable data to an unauthorized third party.
Anthropic stated that the January 22nd incident was caused by human error rather than a breach of its systems and that no sensitive personal data like banking information was exposed. Though Anthropic does not believe there has been any malicious activity resulting from the disclosure, it advised affected customers to be vigilant against potential phishing attempts using the leaked data.
The company apologized for the incident and said its team is available to provide support.
TikTok owner ByteDance's chief warns against mediocrity as AI disrupts
ByteDance CEO Liang Rubo warned employees that the company risks becoming complacent and mediocre as it faces growing competition from startups. He criticized ByteDance for being late to adopt new AI technologies like GPT models, while younger startups are quicker to spot and adopt cutting-edge innovations. Liang said ByteDance has become less efficient as it expanded, with excessive bureaucracy now causing 6-month delays on projects a startup could do in 1 month.
In response, ByteDance is increasing its focus on AI, including testing several chatbots, although its AI strategy recently came under scrutiny over the use of OpenAI's technology.
🎛️ Algorithm Command Line
It looks like ChatGPT knows some strange things about the internet. 🤓
💡Explained
Evaluating Multi-Modal Large Language Models
How to evaluate multiple modalities? In the recent over 300-page paper (From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities) researchers broadly evaluated Multi-modal Large Language Models (MLLM) on Text, Code, Image, and Video modalities. The paper evaluates closed-source models like GPT-4 and Gemini, along with six open-source models, across 230 case studies in four modalities, focusing on reliability. It aims to understand their capabilities and limitations for practical applications.
In this explained section we will focus only only os summarizing the text modality part, as it’s the most relevant to the topic I usually explain. The text category was divided into three main categories, and then into subcategories. Subcategories also covered different aspects and metrics.
🤖 Generalization
In machine learning, generalization means the model’s ability to produce good-quality outputs on newly seen data. In LLMs, generalization broadly means a model’s ability to understand and generate text, which is a crucial aspect of measuring their overall capabilities.
Categories: In the paper, they evaluated the six main categories: Mathematics (analysis, numerical understanding, and resolving problems), Multilinguality, Reasoning (how efficiently one can reach solutions or conclusions from the evidence at hand), Role-playing, Creative writing, and a Domain Knowledge like medicine or economics.
Data: Authors noticed the problem of data leakage: existing test datasets are likely to be included in the model’s training which makes fair comparison impossible. Hence, they invited experts to manually construct test datasets made out of 44 challenging test cases.
Results: The clear winner in this category was GPT-4 with 83.33%, with the Gemini Pro in second place with 59.05%.
🤝 Trustworthiness
It refers to the model's ability to produce safe, accurate, robust, moral, legally compliant, fair, and privacy-protecting content.
Categories: They evaluated safety which refers to the toxicity and extreme risks in LLMs' output like hate speech, pornography, or violent content. The degree of hallucination in LLMs, robustness, morality, and Fairness. And finally, they included data protection and whether the model generates suggestions against the law, such as theft.
Data: They used existing trustworthiness evaluation frameworks
Results: In this category, LLama-2 won with 95.24% (GPT4 got 80.95%).
🎭 Causality
It refers to the ability to understand and generate content that accurately reflects cause-and-effect relationships.
Categories: This includes assessing LLMs' proficiency in identifying and calculating statistical correlations, their capacity for simulating changes or interventions in real-world scenarios, and reasoning about hypothetical alternatives to actual events. Additionally, it involves investigating their ability to uncover causal links between events, computing causal effects, and maintaining accuracy when prompt changes. The evaluation also tests LLMs' Causal Hallucination and their adherence to instructions in varied causal scenarios.
Data: Open datasets like CLadder, e-CARE, and others.
Results: In this category, the GPT-4 also won with 82.22%, 2nd place Mixtral 44.44%
Conclusions
The authors conducted a very comprehensive survey of multiple modalities, provided examples for each, and shared the list of used datasets. Moreover, they shared the code. The big advantage of the study is their care to open the details on how the models were evaluated and what categories to include.
I personally find the open-source code that allows for model evaluation very useful. The process of model development is one thing, but the model evaluation is always very challenging. However, the lenght… It gives me a headache. It is difficult, if not impossible, to make such a comprehensive paper significantly shorter. It could easily become a book on how to evaluate MLLMs.
🗞️ Longreads
The Cult of AI. How one writer's trip to an annual tech conference left him with a sinking feeling about the future (read)