Leon Ericsson | Deep Learner

Blog posts

26 Apr 2025next token prediction
22 Feb 2025relative entropy
29 Jan 2025what to-do
02 Jan 2025multi-head latent attention
20 Dec 2024what's the deal with agents?
19 Nov 2024next token prediction
28 Sep 2024things i want to do
17 Sep 2024what is o1; why is it a big deal?
29 Aug 2024maximize gpu utilization
24 Jul 2024the bitter transformer lesson
08 May 2024a research engineer
23 Apr 2024llama/phi-3, scaling laws, and the benchmarking conundrum
21 Mar 2024making sense of floating points
17 Feb 2024language models. world models.
06 Jan 2024beyond chinchilla - embracing a world of inference
31 Dec 2023what we've learnt in 2023. and what we haven't
17 Dec 2023gpu: a technical orientation
31 Oct 2023training in fp-8
20 Oct 2023efficient training for the gpu-poor
19 Oct 2023distributed training for the gpu-poor
11 Oct 2023current thoughts on large language models
17 Sep 2023inference, quantization, and the .cpp projects

Literature reviews

14 Feb 2025DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
06 Feb 2025DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
29 Dec 2024Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
27 Dec 2024Training Large Language Models to Reason in a Continuous Latent Space
26 Dec 2024Byte Latent Transformer: Patches Scale Better Than Tokens
19 Sep 2024Physics of Language Models
20 May 2024LoRA Learns Less and Forgets Less
09 Apr 2024Knowledge Capacity Scaling Laws
03 Mar 2024Reading Roundup
23 Feb 2024OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
15 Feb 2024Reading Roundup
15 Jan 2024Direct Preference Optimization: Your Language Model is Secretly a Reward Model
04 Jan 2024Reading Roundup
02 Jan 2024LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
12 Dec 2023Mixture of Experts
06 Dec 2023Gemini and AlphaCode 2
30 Nov 2023Constitutional AI: Harmlessness from AI Feedback
17 Nov 2023LoRA: Low-Rank Adaptation of Large Language Models
16 Nov 2023Reading Roundup
15 Nov 2023Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
09 Nov 2023Robust Speech Recognition via Large-Scale Weak Supervision
07 Nov 2023Mistral 7B
30 Oct 2023Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
27 Oct 2023ConvNets Match Vision Transformers at Scale
22 Oct 2023VIMA: General Robot Manipulation with Multimodal Prompts
15 Oct 2023Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
09 Oct 2023Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning
08 Oct 2023Communicative Agents for Software Development
05 Oct 2023Reward is enough
29 Sep 2023Superhuman AI for multiplayer poker
28 Sep 2023Grandmaster level in StarCraft II using multi-agent reinforcement learning
27 Sep 2023Dota 2 with Large Scale Deep Reinforcement Learning
26 Sep 2023Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
17 Sep 2023Llama 2: Open Foundation and Fine-Tuned Chat Models
10 Sep 2023Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality
10 Sep 2023What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
05 Sep 2023Training language models to follow instructions with human feedback
31 Aug 2023Evaluating Large Language Models Trained on Code
23 Aug 2023Deep Double Descent: Where Bigger Models and More Data Hurt
23 Aug 2023LLaMA: Open and Efficient Foundation Language Models
22 Aug 2023PaLM: Scaling Language Modeling with Pathways
07 Aug 2023Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
07 Aug 2023TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
04 Aug 2023Pre-Trained Large Language Models for Industrial Control
04 Aug 2023An image is worth 16x16 words: Transformers for image recognition at scale
01 Aug 2023Training Compute-Optimal Large Language Models
28 Jul 2023Scaling Language Models: Methods, Analysis & Insights from Training Gopher
26 Jul 2023Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors
23 Jul 2023Learning to summarize from human feedback
20 Jul 2023Fine-Tuning Language Models from Human Preferences
10 Jul 2023Policy Gradient Methods and Proximal Policy Optimization Algorithms
28 Jun 2023Speeding up Transformers
25 Jun 2023Language Models are Few-Shot Learners
16 Jun 2023ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations
09 Jun 2023XLNet: Generalized Autoregressive Pretraining for Language Understanding
06 Jun 2023RoBERTa: A Robustly Optimized BERT Pretraining Approach
03 Jun 2023Universal Language Model Fine-tuning for Text Classification
25 May 2023Scaling Laws for Neural Language Models
09 Jan 2023MRI Super-Resolution with Ensemble Learning and Complementary Priors
06 Jan 2023U-Net: Convolutional Networks for Biomedical Image Segmentation
05 Jan 2023Cramming: Training A Language Model On A Single GPU In One Day
18 Dec 2022Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
12 Dec 2022Language Models are Unsupervised Multitask Learners
08 Dec 2022Adam: A Method for Stochastic Optimization
06 Dec 2022Generating Wikipedia by Summarizing Long Sequences
23 Nov 2022BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
03 Nov 2022How Does Batch Normalization Help Optimization?
28 Oct 2022Improving Language Understanding by Generative Pre-Training
25 Oct 2022An overview of gradient descent optimization algorithms
10 Oct 2022Discovering faster matrix multiplication algorithms with reinforcement learning
03 Oct 2022Attention is all you need
01 Oct 2022Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
29 Sep 2022Efficient Estimation of Word Representations in Vector Space
01 Jul 2022Human-level control through deep reinforcement learning
13 Jun 2022RRT* - Sampling based Motion Planning
10 Jun 2022ARA* - Anytime A* with Provable Bounds on Sub-Optimality
05 Jun 2022Implementation of the Pure Pursuit Tracking Algorithm