Gemini and AlphaCode 2

Original paper · Gemini Team, Google et al 2023

It's finally here. After months of rumors, question marks and speculation, it looks like Google actually pulled through. The sentiment was pretty split on if they were going to succeed, especially considering the rumors just last week of the launch being pushed back to next year. Question is though if it's still a bit too late, in the research community I feel like the general hype for LLM's has reduced significantly, there's a looming sentiment that we've exhausted the scaling laws and are only looking at minimal returns on future computational efforts. Anyway, Google is reporting that it's managed to beat GPT-4 across most benchmarks so let's focus on the present and see what there is to gather from their technical report, even though I worry that it's going to be very little...

Recycling old architectures?

Gemini is, to no-one's suprise, a decoder-only Transformer model, enhanced with numerous optimizations to enable stable training and optimized inference on Google's Tensor Processing Unit. The TPU sees very little love outside of Google's internal work, primarily because Google only sells them to select Cloud partners. That mean's they can't utilize the work horse that is open-source communities and instead have to rely on a lot of work internally. While this is costly, it comes with the advantage of having complete control of the entire stack.

The model comes in three flavors: Ultra, Pro, and Nano, each designed for specific purposes. They support an extensive 32k context length and employ multi-query attention mechanisms for efficient processing. The multimodal capabilities of Gemini are particularly striking; the models are trained, beginning to end, across multiple modalities, accommodating audio, video and image input. Videos are encoded as a sequence of image tokens. Remarkably, the Pro model achieves its training goals in mere weeks, a testament to its efficiency. Meanwhile, the Nano series (two models at 1.8B and 3.25B) is distilled to produce best-in-class small language model performance, they are 4-bit (!!) quantized to run on-device. Unfortunately, this is they tell us about the architecture.

Training

They go on discussing the immense complexities of training models of this scale. Ultra is trained on a large fleet of TPUv4 accelerators across multiple SuperPods (4096 chips). We would expect the total number of chips to be somewhere in the 10s of thousands. The intra-cluster and inter-cluster networks are fast enough to warrant commonly used synchronous training techniques such as model and data parallelism. They implement everything using JAX and are able to improve overall goodput (time spent computing over the total elapsed training time) from 85% (PaLM) to 97%. I've quite enjoyed using JAX lately, it's become a favorite in a couple of reinforcement learning projects thanks to the ability to vectorize environments and run an entire GPU-only RL setup.

Gemini's dataset is both multimodal and multilingual, as one would expect given the extent of Google's name. The recent rumor stating Gemini's delay, claimed that the model wasn't handling multilingual very well but it seems like this has been fixed then. The data used comes from web documents, books and code, including images, audio and video data. I speculate they have been able to extract a lot of high quality textual data from audio and video but we can't know for certain because they don't tell us the amount of tokens used. SentencePiece is used as the tokenizer,however SP isn't a tokenizer in itself but rather a tokenization framework so it's unclear exactly what algorithm is used. Despite not telling us how many tokens the models are trained on, they do state that they follow the approaches of Chinchilla and Llama, with smaller models training for significantly more tokens to improve performance for a given inference budget.

Evaluation

I don't want to get too bogged down with the details in the evaluation but certain things stood out to me.

Gemini Ultra is the first to achieve superhuman performance on MMLU, an exam benchmark spanning 57 subjects. This feat is the result of its innovative self-consistency with chain-of-thought (CoT-SC) approach, which employs an ensemble of independently sampled chains of thought to derive answers. Google terms their specific method, uncertainty routed CoT, relying on a threshold consensus between chains to determine an answer before backtracking to greedy sampling using maximum likelihood. Interestingly, while Gemini Ultra's CoT@32 configuration even surpasses GPT-4’s reported performance in MMLU, it falls behind in few-shot learning scenarios. This highlights the nuanced differences in how these models handle limited data or context.

A critical aspect of Gemini's evaluation is its synthetic retrieval tests, aimed at assessing the model's long context capabilities. These capabilities are essential for applications that deal with extensive datasets, such as legal or historical document analysis. Gemini reportedly achieves a 98% accuracy in retrieving correct values across its full context length. However, the experimental design – placing key-value pairs at the start of a context followed by filler text – raises questions about the model's ability to process and recall information uniformly across a large, unstructured dataset.

Beyond textual understanding, Gemini’s performance in multimodal tasks is a significant leap forward. Its state-of-the-art results in processing image, video, and audio data showcase an advanced level of understanding and processing across different data types, a capability that extends well beyond what text-only models like GPT-4 offer. While there's ongoing discussion about Gemini's textual capabilities compared to GPT-4, its proficiency in multimodal tasks is a clear indicator of its advanced and versatile AI technology.

Instruction tuning

I was hoping to get a lot more out of the instruction tuning details behind Gemini, but alas we have to settle for three paragraphs. Gemini's development underscores the importance of data quality over quantity. This principle is vital for creating models that are not only high-performing but also nuanced in understanding and response. The model's development involved a balancing act between helpfulness and harmlessness, achieved through multi-objective optimization. This approach is crucial for ensuring that model outputs are accurate, relevant, safe, and ethically sound. Additionally, the use of a methodology similar to Constitutional AI, where variants of Google's content policy are embedded into the training regime, suggests a move towards more responsible and regulated AI development. By aligning the model with specific ethical guidelines and standards, Gemini represents a significant step towards AI that is not only technologically advanced but also aligned with broader societal values and norms.

AlphaCode 2

Despite a lackluster amount of details on Gemini, there was another nugget that dropped today - AlphaCode 2 - and this time we actually do get something to sink our teeth into! This new iteration builds upon the foundation laid by its predecessor, AlphaCode, which was the first AI system to show competitiveness in this arena. AlphaCode 2 advances this legacy by solving 1.7x more problems than AlphaCode, achieving a remarkable standing at the 85th percentile of competitors on Codeforces.

The system's efficiency and capability are rooted in its integration of Large Language Models, enhanced by advanced search and re-ranking algorithms. The workflow of AlphaCode 2 is outlined in the figure below, providing a clear visualization of its sophisticated process:

The algorithm behind AlphaCode 2 is methodically structured:

Model Generation: A host of AlphaCode 2 models are created through fine-tuning processes on Gemini Pro. This phase leverages expert demonstrations from a range of high-quality sources, integrating a scoring model and employing offline reinforcement learning for enhanced performance.
Sampling Process: In this stage, the AlphaCode 2 models generate up to a million code samples for each problem, utilizing a high temperature parameter to encourage a diverse range of solutions. The goal is to exhaust the model’s diverse distribution, maximizing the probability of containing the correct solution within these samples. This approach leaves a substantial task for the filtering and re-ranking algorithms to identify the best solutions.
Aggressive Filtering: Each competitive programming task usually includes at least one public input/output test. AlphaCode 2 utilizes this feature to aggressively filter the generated code samples, where on average, about 95% of the samples are discarded during this phase.
Clustering and Trimming: Post-filtering, there are typically around 50,000 candidate solutions per problem. To manage this volume, samples are clustered based on their runtime behavior. The clusters are then ordered by size, and only the 10 largest clusters are retained. This strategy is aimed at reducing redundancy and focusing on the most promising solution sets.
Scoring and Selection: The final step involves using the scoring model to select the optimal solution from each of the top 10 clusters for submission.

While AlphaCode 2's approach has shown to yield the best results seen to date in competitive programming, it is not without its challenges. The system requires extensive trial and error and relies heavily on filtering out ineffective samples. This process, though effective, is resource-intensive and potentially costly to operate at scale. However, despite these limitations, AlphaCode 2 represents a significant step forward, indicating that while it may not yet be a fully rounded and universally applicable solution, it is certainly moving in the right direction.