Google is trying to catch up as much attention at their Gemini as possible.
As for now, you can try the Gemini Advanced for free for at least 2 months - as a part of Google One AI Premium Plan (the longer name for this service would be better I guess).
All you need to do is click here, and start using it.
Google promises that in the near future any participant and subcriber will get the access also to the integration of Gemini with Docs, and Gmail, and some more features.
Let's see how it goes.
🗝️ Quick Bytes:
Google’s AI now goes by a new name: Gemini
Google has rebranded its AI products under one name: Gemini. This includes the chatbot formerly known as Bard, as well as the AI features in Google Workspace apps previously called Duet. Gemini will power more Google products going forward as the company bets on AI.
In tests, Gemini has performed nearly on par with competitors like GPT-4. However, it was significantly slower. Now Google aims to prove Gemini can keep pace as it tries to convince developers to build on its AI platform rather than rivals.
The rebranding raises the stakes for Google to compete in AI against OpenAI, Anthropic, Perplexity and others. With the company appearing fully committed to Gemini, its success or failure will significantly impact Google as a whole.
OpenAI is adding new watermarks to DALL-E 3
OpenAI is adding new watermarks to images generated by its DALL-E 3 AI system. The watermarks will include both invisible metadata and a visible symbol in the corner. This allows people to verify if an image was made by AI and identify which system created it. Major technology companies formed the Coalition for Content Provenance and Authenticity (C2PA) to develop these AI content credentials. The goal is to increase trust in digital content, but watermarks can still be removed, limiting their effectiveness.
The watermarks being added to DALL-E 3 images contain both invisible metadata and a visible "CR" symbol. The metadata can be read by services like Content Credentials Verify to show that the image was made by AI, specifically DALL-E 3 in this case. The visible symbol also shows the image was AI-generated and aims to discourage passing it off as created by a human.
Companies like Adobe, Microsoft and Meta are members of the C2PA coalition, which developed the content credentials standard OpenAI is now adopting. The Biden administration has also urged use of visible labels on AI content to combat misinformation. However, OpenAI admits the watermarks can easily be removed, especially once images are uploaded to social media sites. So while an important step, watermarks alone cannot fully solve issues around identifying AI-generated content.
The uncomfortable truth about AI’s impact on the workforce is playing out inside the big AI companies themselves
Google, Microsoft and other major tech companies are massively increasing investments in AI, even as they slow hiring or cut jobs in non-AI areas. Alphabet plans to ramp up capital spending significantly in 2024, focused overwhelmingly on servers and data centers to support AI applications, while at the same time recently laying off around 1,000 employees. Microsoft also continues to cut jobs in gaming and other units while focusing hiring on AI talent, planning a material increase in capital expenditures on cloud and AI infrastructure.
As Google, Microsoft and others chase leadership in AI, they are streamlining operations and realigning spending, with safe jobs being those that directly support AI ambitions. Both companies have referenced using AI itself to drive internal efficiencies and cost savings, likely indicating automation of some roles. Alphabet CFO Ruth Porat specifically called out "streamlining operations across Alphabet through the use of AI" as helping curb hiring plans.
The laser focus on AI investments even amid job cuts indicates that core AI talent will be most valued across the tech landscape going forward. Both Google and Microsoft are focused on top technical AI experts to drive new products and services, while cutting non-essential areas. This likely foreshadows an emphasis on AI skills being key for tech industry job security.
🎛️ Algorithm Command Line
I found a very interesting prompt in one of Perplexity’s channels on Discord. It’s a bit complicated at first but if used properly - it works like a charm.
I noticed that some of the answers that I got from not only GPT but also Gemini and Claude got significantly better and more relevant.
It’s a structured reasoning framework and a methodical approach to problem-solving. It involves five key steps that were designed to improve the output that you get from LLMs.
Try it out!
Adhere to the following structured reasoning framework to craft response:
1. Decomposition of the Problem: Dissect the complex problem into individualized elements termed as "thought". These function as independent sub-problems, akin to puzzle pieces in a grander scheme, each contributing to the resolution of the main problem.
2. Ideation of Potential Thoughts: For each "thought", brainstorm various solutions(i.e. intuitive or heuristic judgment based on the thought, or logical or analytical reasoning based on the thought). Consider this stage as planting seeds of ideas, nurturing each with careful consideration.
3. Assessment of Thoughts: Conduct a meticulous appraisal of the incremental progress achieved by each "thought" in problem-solving. This involves carefully considering the merits and demerits of each solution, as if balancing scales of judgment.
4. Optimal Reasoning Path Selection: Undertake a rigorous review of all potential routes, contrasting and examining efficacy, and whether alignment with the root problem. Post this evaluation, choose the most suitable course of action, like navigator charting the most promising course.
5. Generate the Final Response: Consider the insights from previous steps, Provide detailed explanations and enrich of the best path, then assemble a clear answer.
Finally, provide a confidence score between 0.0-10.0 based on logic, evidence support, Accuracy, Relevance, Innovativeness & Actionability, explain the reason for each score.
💡Explained
Efficient Exploration for LLMs
Recently, a new and exciting paper from Google DeepMind and Standford University on efficient exploration in gathering human feedback to improve LLMs was published. Authors raise important questions, including one that I also asked myself a few times. Will gathering more and more data will help us get better models? And what if we use all of the data? Given that we currently, use training processes like RLFH to learn only from humans, how can we hope for superhuman efficiency? We are modeling LLMs to be like humans. So we are creating imperfect humans.
Imperfect human but with broad horizons?
Do you remember the unconfirmed OpenAI leak about the Q* algorithm that could bring us one step closer to AGI? At that time some theorists considered a theory that LLMs might not be perfect but they can still be used to generate thousands of ideas, where one of them can be groundbreaking. The authors had similar idea. Imagine a pretrained model that extrapolates from its training data to generate large numbers – perhaps millions or billions – of ideas, and concepts. If even only one of them is groundbreaking we can continue building on top of that. This way, with enough human feedback, a model is taught to become capable of generating content that a human could not. But how much time would take data collection? Months, years, or decades?
🗺️ Exploration and Human Feedback
Classic RLHF is a method of model training where queries, each consisting of a prompt and generated responses, are sent to human annotators and rated from the best to worst, according to human preference. Then, using that data we train a Reawrd Model and use that during the final model alignment to human preference. The authors call this approach a passive exploration.
🏃🏻 Active exploration
Active exploration is a strategic selection of queries based on past human feedback to improve the training process of LLMs. The method contrasts with passive exploration, where queries are chosen without leveraging past feedback. Agent, during the training actively selects response pairs to maximize the quality of future feedback, using information gained from past interactions. In simpler words, the idea is to use what the model has learned from earlier feedback to pick the next set of questions that are most likely to give new and useful information.
So how the answer pairs are selected? The authors experimented with two different approaches to exploration.
📘Boltzmann Exploration
The idea behind using Boltzmann exploration is to make „educated guesses” (assigning a probability to each response based on its estimated reward) on which questions to ask next. This is based on which questions got helpful answers before. Responses with higher estimated rewards are more likely to be chosen, but there's still a chance of selecting less favorable responses to ensure exploration.
🤔Epistemic Neural Networks
ENNs model the uncertainty about the rewards, which are associated with different responses. It means, that ENN doesn't just assign a single, fixed reward to each response based on how well it matches human preferences, but instead, it considers a range of possible rewards, each corresponding to a different "what-if" scenario represented by the epistemic index. This approach knows that confidence in the reward might vary: in some cases, annotators might be very sure which response is better, while in others, they might be not certain. By modeling uncertainty, ENNs make more informed decisions about which responses are likely to be more preferred, which helps the model learn more effectively using human feedback.
Infomax - takes an ENN reward model as input and generates N responses. For each pair of responses, the ENN predicts how likely it is that one will be preferred, using different scenarios (epistemic indices). Infomax then measures how much these predictions vary across scenarios and picks the pair of responses with the greatest variation (to learn the most from the feedback it gets).
Double TS - picks queries that help find the best responses. It tries to choose two responses that might be the best by first generating a set of responses, then picking two from this set based on their potential rewards, ensuring that they are different. If it can't find two distinct responses after several tries, it randomly picks the second one.
Results
The best win rate was noted for double TS, then infomax, Boltzmann, and passive exploring at the last place. Authors showed that Double TS more accurately predicts human preference for the first response over time, after 40,000 queries. Unlike Boltzmann’s exploration, which lacks guidance from uncertainty estimates and fails to adjust its predictions effectively, Double TS uses uncertainty to make better predictions, showing a great ability to adapt and learn from feedback. However, what is most important is that asking the right questions based on what the model already knows makes training much faster and might lead to breakthroughs, in theory surpassing human creativity.
The New „explained” section was prepared for you by a great Senior Software Researcher Dominykas Stankevičius (Chaptr) specializing in AI, Agents, LLMs, MLOps, and Prompt Engineering.
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Researchers at Microsoft have shown some impressive results on how far you can take a base GPT-4 model without the need for fine-tuning, by applying some "prompt innovation". Using a combination of existing and newly introduced prompt engineering techniques, their approach significantly outperformed a heavily fine-tuned and prompt-engineered PaLM-2 model from Google on the MultiMedQA collection of medical benchmarks based on multiple choice questions. They name their approach "Medprompt", and contrary to what the name suggests, they show that it is a general prompting strategy applicable to a wide range of tasks. On the other hand, Google's prompts were specifically adapted for the dataset. This seems like a decent argument for the question of prompt engineering vs. fine-tuning that many companies are constantly considering.
⚙️No more need for "think step-by-step"?
One of the authors' novel contributions was to allow GPT-4 to come up with its own reasoning traces. Instead of fine-tuning the model on the available training data, they instead generated a chain-of-thought rationale along with the predicted answer to each medical question, and only retained those reasoning traces which ultimately led to the correct answer, while discarding the incorrect ones. For evaluation, these training examples with self-generated reasoning traces were used as "few-shot" examples, also known as "in-context learning". The authors argue that these GPT-4 generated rationales were longer and more detailed than the ones crafted by clinical experts, used with the PaLM-2 model. This is a great example of what's possible with labelled and well-structured text datasets before considering spending a lot of resources to fine-tune your own LLM on them.
👨🔬Still have to take your eyes off the test set?
Another interesting idea the authors experimented with was using an "eyes-off" test set for prompt engineering, a common practice in machine learning to prevent overfitting a model on your training and validation data. They argue that an LLM's prompt should be considered as a hyperparameter which we are adjusting based on our data, therefore it is still possible to overfit it even if we're not adjusting the actual parameters of the model. The experiments showed that using an eyes-off test set even caused the model to perform better in some cases, although by a very small margin. It's important to note that this likely applies not just to the prompt itself, but also to the overall prompting strategy used.
🤖 Training GPT-4 without actually training it
Another gain in performance was through the use of what the authors refer to as "dynamic few-shot learning". Here, instead of packing a fixed number of data examples from the training set into the model's context, 5 training samples are selected at inference time based on the five nearest neighbours to the test sample in the embedding space of all training examples. The authors claim that in a way, this acts like training the model on the whole dataset without actually modifying its weights, although the gain in performance over selecting the 5 samples randomly only produced a 0.8% gain.
🧩Putting it all together
Combining all the above techniques and finishing it all off with a sampled (non-greedy) decoding by shuffling the possible answer choices 5 times and choosing the most popular one, also referred to as self-consistency prompting, led to a 90.2% final performance on the MedQA dataset, almost a 4% gain over the heavily fine-tuned PaLM-2. Another thing worth noting is that the authors only used 5 inference calls to the LLM for each test case, compared to a whopping 44 model calls used by the Google team. Finally, Medprompt is shown to generalize over other medical challenges, as well as similarly structured datasets across law, psychology, engineering, philosophy, and others. The authors end by encouraging the use of Medprompt for non-multiple choice settings as well, although that might require some extra "prompt innovation".
🗞️ Longreads
Inside OpenAI’s Plan to Make AI More ‘Democratic’. (read)