Stanford published their AI Index Report 2024, and it's... long (about 500+ pages). I wrote a quick recap with the insights on my LinkedIn but I also want to leave the full report here with you to slowly digest the info. There are a lot of valuable numbers there.
🗝️ Quick Bytes:
Meta’s battle with ChatGPT begins now
Meta's AI assistant, introduced last September and powered by the new Llama 3 model, now integrates with platforms like Instagram, Facebook, WhatsApp, and Messenger, and is accessible through Meta.ai. This expansion includes new features in Facebook's main feed and message inboxes, with enhanced capabilities like real-time image generation and integrated search results from Google and Bing. Initially available in multiple English-speaking countries, it is poised for broader global deployment.
Google is combining its Android and hardware teams — and it’s all about AI
Google has restructured internally to prioritize artificial intelligence, creating a "Platforms and Devices" team under Rick Osterloh that oversees Pixel, Android, and Chrome. This move aims to enhance hardware and software integration and accelerate AI adoption across Google's products, including AI-driven features in Pixel cameras and broader deployment of the Gemini model in its systems and devices. The reorganization also redefines roles, with Hiroshi Lockheimer focusing on new Google and Alphabet projects, and Sameer Samat becoming president of the Android ecosystem, ensuring strong partnerships and advancing first-party hardware development.
Microsoft's VASA-1 is a new AI model that turns photos into 'talking faces’
VASA is an advanced framework that creates realistic talking faces from a single image and speech audio, effectively syncing lip movements and capturing dynamic facial expressions and head movements. It uses a sophisticated model in a disentangled face latent space to generate high-quality, real-time videos at 512x512 resolution and 40 FPS. Although VASA offers promising applications in education and therapy, there are risks like impersonation. The developers prioritize responsible AI usage and plan to delay public release until its safe use is assured.
🎛️ Algorithm Command Line
How to extract insights from podcasts in minutes instead of hours?
I want to show you my method of how you can distill insights and key ideas from long podcasts.
This method is particularly beneficial for those who conduct regular research, generate social media content, seek to brainstorm ideas, or need concise insights due to time constraints.
1
Prepare your transcript as a text file.
2
Use this prompt:
You’re an experienced researcher. Your goal is to extract insights and direct quotes from this podcast transcript.
INSTRUCTIONS
Follow these steps carefully:
1. Write down the insights in a nested bulleted list with clear headers.
2. Each section should have at least 10 bullets. Be detailed and thorough.
3. Include direct quotes or passages to bring the insights to life.
3
Execute the prompt and review the output. Check out the example in my video below to see how it's done.
Have you ever experimented with longer context LLMs? For example gpt-4–1106-preview or gpt-4–0125-preview? If so, you probably know by now that in-context learning tends to perform below optimal with models like these. Larger context models end up assigning more importance to the beginning of the text and the end of it, but the middle part often gets ignored later when prompted on that information.
Besides, a single inference run with a document that has 100k tokens would take 1.5 USD on Claude 3 Opus1 and 1 USD on GPT-4-turbo2.
The paper that came out just a couple of days ago by scientists at UC Berkley might give us a solution for that.
Summary
Let’s start with two questions:
How can we train a model to produce a compact representation (cheat-sheet) of the original context that the LLM can interpret and utilize effectively?
How to enable the LLM to proficiently navigate and extract relevant details from this representation during inference?
A good answer to the first question would be context compression.
This study focuses on the second question. To address this issue, scientists employ parameter-efficient finetuning (PEFT) directly on the compacted context (cheat sheet) without altering its content, which significantly improves the LLM’s ability to accurately extract and utilize information from these compressed representations. This is called LLoCO, or Learning Long Contexts Offline.
Pipeline
LLoCO pipeline is comprised of preprocessing, fine-tuning, and serving stage. Check out the paper for instructions on how to integrate this pipeline into your RAG system to enable retrieval-augmented document QA with compressed context.
Preprocessing stage: Building a vector DB, vectorizing chunks, and creating summary token embeddings.
Finetuning stage: Grouping documents according to type, and performing PEFT using a LoRA adaptor.
Serving stage: Instead of retrieving the actual passages like in RAG, the retriever is used to retrieve the compressed token embeddings of the passage, then they are concatenated and prepended to decoder LLM. Corresponding LoRA adaptor is searched and applied to decoder LLM.
Architecture
Architecture is comprised of a context encoder (for compressing original long context into a more compactful representation) + LLM decoder (4k LLaMA2–7B).

Context encoder can be any model capable of producing a compact representation aligned with the LLM decoder. Think of the summary embeddings as pseudo-words in the LLM decoder’s text embedding space, representing abstract concepts or summaries (in this paper, the scientists used AutoCompressor for LLaMA2-7B).
Key Takeaways for Engineers
LLoCO: novel pipeline that combines context compression, retrieval, and parameter-efficient finetuning. This pipeline could be deployed to significantly speed up and reduce the cost of long document question answering.
QA datasets used in assessing LLoCO: QuALITY, Qasper, NarrativeQA, Hot-potQA; Summarization dataset: QMSum
LLoCO outperforms the baseline on all datasets by a substantial margin while using 30 times fewer tokens.
Its performance is the most impressive on NarrativeQA (average document length ~85K tokens)

LLoCO outperforms LLaMA2–7B-32K on Needle in a Haystack task: Choose a haystack (long article, +32K tokens) and hide a needle in it (key piece of “hidden” information that needs to be retrieved)

LLoCO realizes speed-ups of up to 7.62× on A100 and 7.19× on A6000 GPU when compared to the LLaMA2–7B baseline without compression, under identical context conditions

Key Takeaways for Tech Industry Stakeholders
If you are struggling to obtain accurate responses from your chatbot or a similar LLM-based QA system when querying longer passages of text, consider sharing my article on LLoCO with your team.
This paper is going to be relevant if you are working with LLaMA2–7B model as there is an easily accessible autocompressor fine-tuned on it, and the combination of the two has been tested in this paper.
I recommend experimenting with this if your data is static and changes infrequently.
Personal take
Rather than implementing the entire LLoCO pipeline, consider just compressing the context when it exceeds a certain threshold. If you establish this is improving the results of your system, consider implementing the entire LLoCO pipeline as described in the paper.
Would love to see other compression methods tested, other than just AutoCompressor for LLaMA2–7B. This would of course imply testing with other LLMs, which would be my following point. The reason behind why that was not done was not presented in the paper (to my best understanding), but I can assume it is because not many LLMs also come with 1) possibility of finetuning, 2) a compatible context encoder.
I would love to see more context encoders being fine-tuned for different LLMs, and I believe that’s the direction the industry will continue going due to the success of context encoders on smaller LLMs like LLama2–7B.
I believe this method works best with static data that doesn’t frequently change, as performing PEFT frequently might get quite expensive, so bear that in mind when deciding whether LLoCO is worth implementing for your use case.
Original code is available here.
Original paper is this one.