🗝️ Quick Bytes:
Apple and Google are discussing a deal to bring generative A.I. to iPhones
Apple is in discussions with Google to use Google's Gemini generative AI model for the next iPhone. This would extend the long-standing partnership between the two companies, with Google previously providing services like Maps and being the default search engine on iPhones.
The potential deal could give Google access to Apple's massive user base for its AI capabilities. However, it may also face regulatory scrutiny due to the existing antitrust lawsuit against Google over its search agreements with Apple. The talks are still preliminary, and Apple has also discussed AI partnerships with other companies like OpenAI.
Nvidia reveals Blackwell B200 GPU, the ‘world’s most powerful chip’ for AI
Nvidia has announced its new Blackwell B200 GPU and GB200 "superchip" for AI workloads, offering up to 20 petaflops of FP4 performance and claiming 30x better performance for large language model inference compared to the previous H100 chip.
The GB200 combines two B200 GPUs with a Grace CPU and can support up to 27 trillion parameter models when multiple GB200 systems are interconnected using Nvidia's new NVLink technology. Major cloud providers like Amazon, Google, Microsoft, and Oracle plan to offer Blackwell-based systems, while Nvidia is also offering solutions like the liquid-cooled GB200 NVL72 rack with 72 GPUs.
Nvidia claims the new architecture reduces cost and energy consumption by up to 25x over H100 for large language models. The Blackwell GPU architecture will likely power Nvidia's upcoming RTX 50-series gaming GPUs as well.
Sam Altman hints at the future of AI and GPT-5 - and big things are coming
OpenAI is expected to release GPT-5, the next iteration of its large language model that powers ChatGPT, as early as this summer. Some enterprise customers who have seen demos claim GPT-5 is "materially better" and a significant improvement over GPT-4.
In a recent interview, OpenAI CEO Sam Altman hyped up GPT-5, stating the "delta between 5 and 4 will be the same as between 4 and 3" and that it will be "smarter," "faster," with improved multimodal capabilities. However, Altman did not provide a specific release date or confirm if it will actually be called "GPT-5." GPT-5 is still undergoing training and safety testing, which could delay the launch timeline. While details are vague, OpenAI appears poised to release a major AI upgrade in the coming months.
🎛️ Algorithm Command Line
42 daily factors affect my diabetes.
LLMs helps me manage them.
Last year in June, I spent my birthday laying in bed in the intensive care unit. I passed out in my apartment and almost died on the floor. I was lucky that an ambulance arrived in less than 8 minutes.
I didn’t know then that my birthday gift would be type 1 diabetes. Even though I had over 10 years of professional sports experience in my life and was always active.
Sh** happens, right?
Managing diabetes is pure math and data. Every day, 42 factors affect my blood sugar. You absolutely can’t manage all of these things at once.
But I noticed that three things make a significant difference in my results.
Sleep, diet, and physical activity.
So, I’ve almost instantly come up with an idea about creating a custom GPT that will be able (based on PDFs and CSV files) to present me with conclusions about my management. As a data source, I use a CGM on my arm which measures my blood glucose levels 24/7, and a Garmin sports watch which I also have on my wrist for 24 hours a day.
The logic behind this is simple.
CGM + Garmin + AI = better daily choices = better health.
Good glucose control impacts all areas of my life. Most important for me, are my creativity and focus.
(P.S. I get the privacy concerns, but this data is also incredibly valuable for my medical team.)
💡Explained
Today’s „Explained” paper on Memory Compression was prepared for you by Ema Ilić from Croatia, a curious and dedicated AI Engineer@CHAPTR working with Generative AI, NLP, LLMs, and Prompting. Be sure to connect with her on Linkedin!
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Four days ago, a paper was published that really caught my attention.
📚Brief Explanation
Even though Transformers are the backbone of LLMs, inference remains inefficient due to the need to store in memory a cache of key-value representations for past tokens. The size of this cache increases linearly with the input sequence length and batch size. This paper introduces Dynamic Memory Compression (DMC): a method to reduce the length of the Key-Value cache in Transformers during inference (generation), which enhances the memory efficiency and speed of inference in LLMs.
🏆 Methodology
At each step, DMC decides whether new key (k) and value (v) pairs should be appended or accumulated based on computed importance scores, resulting in the compression of the cache. Then the LLM pre-training is continued on a negligible amount of pre-training data, gradually increasing the compression rate towards a set target.
🔑 Key Takeaways for Engineers:
This paper is relevant if you have the ability to pre-train your LLMs, particularly if you’re working with Llama 2 at 7B, 13B, or 70B scale.
DMC is done by retrofitting pre-trained LLMs into DMC Transformers, achieving up to ~3.7× throughput increase during auto-regressive inference on an NVIDIA H100 GPU. This means LLM can now process up to 3.7x more tokens per second during generation!
In addition to speeding up generation, DMC also improves performance on several benchmarks (MMLU and CS-QA).
DMC requires pre-trained LLMs to be retrofitted on a negligible percentage of original pre-training data (2% or 4%, for 2x and 4x compression), without adding extra parameters (relatively low implementation cost!)
DMC surpasses GQA (the current state-of-the-art solution to increase the memory efficiency of Transformers during inference). It can also be combined with GQA for compounded gains.
💸Key Takeaway for Tech Industry Stakeholders:
If you’re dissatisfied with the generation speed of your LLM, Dynamic Memory Compression might be the solution you’re looking for. It might even improve the performance of your LLM, which was demonstrated on two benchmarks.
👩🔬Personal Take:
I’m curious to see how this method would perform on other LLMs besides Llama 2 in terms of inference speed. I’m also interested in whether DMC really surpasses GQA when performed on some other LLMs. Overall, DMC appears to be a useful technique that I would recommend deploying if you are concerned with the generation speed of your Llama 2. If you have some extra resources, I would recommend experimenting with other LLMs as well. Let me know your results!
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Researchers from Apple made a huge step towards building open multimodal models. In the paper, they share model and data design lessons and a general Recipe for Building MM1.
✂️Data ablation
An ablation study is a research technique where we remove or alter parts of the system and check how it affects it. In the paper, they used data ablation. Data ablation is based on a very similar idea: we systematically remove (or alter) parts of the input data to investigate the impact on the model's performance. This process helps in understanding which elements are most critical for achieving high accuracy.
In the context of training multimodal large language models (MLLMs), data ablation involves conducting experiments where certain types of data (e.g., text-only data, image data, interleaved image-text data, or specific features within these data types) are either excluded from or specifically included in the training process. By comparing the performance of the model across these different training configurations, researchers can glean insights into how different types of data contribute to the model's ability to understand and generate responses based on the input it receives.
👨🍳How to select training data? Top lessons.
After performing data ablation studies they formed the following tips:
Mixing different ratios of interleaved (text and image together) and captioned (image with a separate caption) data positively affects model performance. It’s worth to include also text-only data alongside image data. It was crucial for maintaining the model's language understanding capabilities. In general, use a variety of data types.
Example mix: 45% interleaved image-text documents🥕, 45% image-text pair documents 🌽, and 10% text-only documents 🧅.Consider including high-quality synthetic data, such as synthetic captions (VeCap), to improve the model's learning.
Prioritize high-resolution images and use an image encoder capable of handling such resolutions to ensure the model can extract and learn from fine-grained details.
🏗️Model architecture and training
Because we are dealing with multimodality, we have to process different kinds of inputs and have different tokenizers. In models like MM1, images are processed through the image encoder (e.g., a Vision Transformer or ViT) to produce a set of visual tokens, and through text encoders that convert input text into textual tokens.
In the paper they use a Vision Transformer (ViT-H - resolution of 378x378 pixels) model (as an image encoder), pre-trained with a CLIP objective underscores the importance of image resolution for image encoding.
Then, we have a Vision-Language connector (VL). VL connector's role is to integrate the visual tokens from the image encoder with textual tokens derived from the input text.
After the VL connector combines visual and textual tokens, a Multimodal Transformer-based language model processes the sequence, applying complex reasoning over both modalities.
Conclusion
The findings are not surprising but it’s good that took time to confirm those. They showed it’s important to gather a balanced and diverse dataset that includes a mix of image-text pairs, interleaved documents, and text-only documents to support training across modalities. It’s all about quality over quantity. When training the multimodal models we should try to use high-quality, high-resolution images and well-curated text data. Also, as a rising popularity shows, carefully incorporating synthetic data might be useful.
🗞️ Longreads
I recreated the most iconic photos of all time with AI in just one day.
AI image generators are trained on millions — even billions — of photos. It is safe to assume the vast majority of these photos are copyrighted and used without permission. And while there are billions of photographs, only a handful can be labeled as iconic. With that in mind, PetaPixel wanted to find out how easy or difficult it is to recreate celebrated photographs. (read)