what we've learnt in 2023. and what we haven't

December 31, 2023.

As we reflect on the advancements in the field of AI in 2023, it's clear that this year has been a watershed moment. We've seen significant strides in accessibility, simplicity in development, and the democratization of AI technology. Yet, there's still a lot of fog surrounding our technologies, paved with unanswered questions that continue to shape the research landscape.

What we’ve learnt in 2023

LLMs are now accessible. The past year has witnessed the transformation of large language models from mere academic curiosities to pivotal elements in our daily lives. Moving away from the earlier perception of a vague, almost dystopian technology, AI now holds a place of practical significance and emotional resonance in many people’s lives. It’s become increasingly evident that proficiency in these AI tools can profoundly enhance our work efficiency and overall quality of life. This sentiment prevails despite the pockets of skepticism and caution within the academic community. [1] [2] [3].

Building LLMs is suprisingly simple. Contrary to popular belief, the initial stages of developing large language models are less complicated than anticipated. It turns out, their quite easy to setup and run yourself! The real intricacy emerges when scaling up the models and their corresponding datasets, a task that necessitates vast amounts of training data, and reinforces the fact that

Data reigns supreme. Data quality and curation have repeatedly emerged as the primarily pillars of successful base models. Sophisticated data pipeline are now enabling models to rival their larger and more expensive counterparts. A year ago, capable models cost millions of dollars. Today, we're seeing promising work for less than $100k [1]. Training LLMs aren't a poor mans hobby but the playing field has democratized.

Local LLMs. Spearheaded by Meta's release of Llama and followed up by Mistral 7B and Mixtral, open-source LLM has been on everyone mind during 2023. These developments have democratized access to powerful LLMs, making it possible for anyone to run them on various devices, from old laptops to modern smartphones.

Alignment work has exploded. Fine tuning techniques have bloomed across 2023 and hobbyists have dug their nails into this space. I think synthetic data has emerged as one of the greatest prospects here for future work.

We still can't beat GPT-4. Despite a lot of success stories from the open-source space, we still haven't figured out how to build a model as capable as GPT-4. GPT-3.5 has been trumped for a while now but the alleged 8x MoE GPT-4 is still waiting to be slain.

Current benchmarks are outdated. Traditional benchmarks are proving inadequate, as they fail to capture the full spectrum of these models' capabilities. As a result, users are increasingly relying on their personal experiences, anecdotal evidence, and subjective evaluations like those from Chatbot Arena to assess the utility of different models. This is probably the most infuriating part about LLMs right now because we know so little behind what works and what doesn't. It's really hard to be deliberate in your prompting, you've got to really get to know the model your working with before it becomes useful; you've got to learn it's quirks, what works and what doesn't, and how much you need to tip it...

LLMs are great, but also dumb. Even seasoned practitioners find themselves amazed at the occasional spark of ingenuity exhibited. But, on the other hand they sometimes turn out to be really dumb. We've established that despite what it may look like, LLMs can’t reason. They lack simple symmetric reasoning capabilities, often only understanding one side of a relationship and failing to infer the reverse. This issue is predominantly linked to the sequencing of information in their training data.

Giving LLMs time to think. The current design of LLMs does not account for the varying complexities of requests, treating all with uniform computational effort. This isn't representative of our own thinking, model accuracy should increase given more time to process a request. There's a growing consensus that LLMs should have a 'slower thinking' mechanism for complex problems, akin to the 'System 1' and 'System 2' thinking processes described in 'Thinking, Fast and Slow'. LLMs only operate System 1. They lack ability to stop, plan and execute a line of thought. Tree-of-thoughts, Chain-of-thought are early prototypes of this work but their slow and computationally expensive. Do we really have to convert the internal representations to output tokens only to feed them back into the model all the way at the top? The models should be given the ability to think internally!

What we still don't know

How to reliably answer questions about long context formats, such as books.
How do we tell if a model is unsure of what it is saying?
Is tokenization the best way to represent text and images?
How (if possible) can we steer models to prevent hallucinations?
How can we think more for certain inputs? System 2 thinking.
How far can smaller models go? (Interesting results coming from TinyLlama)
How can we build long term memory across interactions?
Is next-word prediction enough?
Can we watermark LM-generated text?
Are transformers an optimal architecture for LLMs? Quadratic computation is still at large.