Blog posts
- 28 Sep 2024things i want to do
- 17 Sep 2024what is o1; why is it a big deal?
- 29 Aug 2024maximize gpu utilization
- 24 Jul 2024the bitter transformer lesson
- 08 May 2024a research engineer
- 23 Apr 2024llama/phi-3, scaling laws, and the benchmarking conundrum
- 21 Mar 2024making sense of floating points
- 17 Feb 2024language models. world models.
- 06 Jan 2024beyond chinchilla - embracing a world of inference
- 31 Dec 2023what we've learnt in 2023. and what we haven't
- 17 Dec 2023gpu: a technical orientation
- 31 Oct 2023training in fp-8
- 20 Oct 2023efficient training for the gpu-poor
- 19 Oct 2023distributed training for the gpu-poor
- 11 Oct 2023current thoughts on large language models
- 17 Sep 2023inference, quantization, and the .cpp projects
Literature reviews
- 19 Sep 2024Physics of Language Models
- 20 May 2024LoRA Learns Less and Forgets Less
- 09 Apr 2024Knowledge Capacity Scaling Laws
- 03 Mar 2024Reading Roundup
- 23 Feb 2024OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
- 15 Feb 2024Reading Roundup
- 15 Jan 2024Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- 04 Jan 2024Reading Roundup
- 02 Jan 2024LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
- 12 Dec 2023Mixture of Experts
- 06 Dec 2023Gemini and AlphaCode 2
- 30 Nov 2023Constitutional AI: Harmlessness from AI Feedback
- 17 Nov 2023LoRA: Low-Rank Adaptation of Large Language Models
- 16 Nov 2023Reading Roundup
- 15 Nov 2023Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
- 09 Nov 2023Robust Speech Recognition via Large-Scale Weak Supervision
- 07 Nov 2023Mistral 7B
- 30 Oct 2023Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- 27 Oct 2023ConvNets Match Vision Transformers at Scale
- 22 Oct 2023VIMA: General Robot Manipulation with Multimodal Prompts
- 15 Oct 2023Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- 09 Oct 2023Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning
- 08 Oct 2023Communicative Agents for Software Development
- 05 Oct 2023Reward is enough
- 29 Sep 2023Superhuman AI for multiplayer poker
- 28 Sep 2023Grandmaster level in StarCraft II using multi-agent reinforcement learning
- 27 Sep 2023Dota 2 with Large Scale Deep Reinforcement Learning
- 26 Sep 2023Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
- 17 Sep 2023Llama 2: Open Foundation and Fine-Tuned Chat Models
- 10 Sep 2023Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality
- 10 Sep 2023What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
- 05 Sep 2023Training language models to follow instructions with human feedback
- 31 Aug 2023Evaluating Large Language Models Trained on Code
- 23 Aug 2023Deep Double Descent: Where Bigger Models and More Data Hurt
- 23 Aug 2023LLaMA: Open and Efficient Foundation Language Models
- 22 Aug 2023PaLM: Scaling Language Modeling with Pathways
- 07 Aug 2023Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
- 07 Aug 2023TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- 04 Aug 2023Pre-Trained Large Language Models for Industrial Control
- 04 Aug 2023An image is worth 16x16 words: Transformers for image recognition at scale
- 01 Aug 2023Training Compute-Optimal Large Language Models
- 28 Jul 2023Scaling Language Models: Methods, Analysis & Insights from Training Gopher
- 26 Jul 2023Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors
- 23 Jul 2023Learning to summarize from human feedback
- 20 Jul 2023Fine-Tuning Language Models from Human Preferences
- 10 Jul 2023Policy Gradient Methods and Proximal Policy Optimization Algorithms
- 28 Jun 2023Speeding up Transformers
- 25 Jun 2023Language Models are Few-Shot Learners
- 16 Jun 2023ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations
- 09 Jun 2023XLNet: Generalized Autoregressive Pretraining for Language Understanding
- 06 Jun 2023RoBERTa: A Robustly Optimized BERT Pretraining Approach
- 03 Jun 2023Universal Language Model Fine-tuning for Text Classification
- 25 May 2023Scaling Laws for Neural Language Models
- 09 Jan 2023MRI Super-Resolution with Ensemble Learning and Complementary Priors
- 06 Jan 2023U-Net: Convolutional Networks for Biomedical Image Segmentation
- 05 Jan 2023Cramming: Training A Language Model On A Single GPU In One Day
- 18 Dec 2022Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- 12 Dec 2022Language Models are Unsupervised Multitask Learners
- 08 Dec 2022Adam: A Method for Stochastic Optimization
- 06 Dec 2022Generating Wikipedia by Summarizing Long Sequences
- 23 Nov 2022BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- 03 Nov 2022How Does Batch Normalization Help Optimization?
- 28 Oct 2022Improving Language Understanding by Generative Pre-Training
- 25 Oct 2022An overview of gradient descent optimization algorithms
- 10 Oct 2022Discovering faster matrix multiplication algorithms with reinforcement learning
- 03 Oct 2022Attention is all you need
- 01 Oct 2022Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- 29 Sep 2022Efficient Estimation of Word Representations in Vector Space
- 01 Jul 2022Human-level control through deep reinforcement learning
- 13 Jun 2022RRT* - Sampling based Motion Planning
- 10 Jun 2022ARA* - Anytime A* with Provable Bounds on Sub-Optimality
- 05 Jun 2022Implementation of the Pure Pursuit Tracking Algorithm