Autoregressive (AR) Language Modeling

Tony Jesuthasan
3 min readJul 31, 2021


Autoregressive (AR) Language Modeling is one of the most well-known and used pertaining objectives in the Natural Language Processing sphere. Given its roots in time series modeling, it is quite difficult to grasp its functionality in language modeling at once. This article aims to provide a simple explanation of AR modeling.

Regression Analysis

A set of statistical methods that are used to evaluate the relationship between a dependant variable and one or more independent variables is known as regression analysis. It is useful to determine the robustness of the relationship of the variables under consideration and for fine-tuning the future relationship between these variables.

Linear Regression

Variations of this analysis include non-linear regression, linear regression, and multiple linear regression. Each variation has its own set of assumptions.

What is Autoregression?

Autoregression is a time series model that uses results from prior time steps as input to a regression equation that predicts the value at the next time step. The term autoregression specifies that it is a regression of the variable against itself.

Two examples of data from autoregressive models with different parameters.

An autoregressive model of order p can be written as follows:

where εt is white noise. This is like a multiple regression but with lagged values of yt as predictors. We refer to this as an AR(p) model, an autoregressive model of order p.

Autoregressive (AR) Language Modeling

The autoregressive model is a feed-forward model, that predicts the future word from a set of words in a given context. The context word, in this model, is restrained to two directions, either backward or forward. Thus making it effective in NLP generative tasks that create context in the forward direction.

Howbeit, it does have a problem. This model can only utilize the forward context or the backward context, therefore implying that both contexts cannot be used simultaneously, this causes the model to curb itself in its understanding of prediction and context.

Feed-forward Autoregressive model animation by Google DeepMind’s WaveNet

An autoregressive parametric model such as a neural network is trained to model the joint probability distribution of a text corpus, for either a forward product or a backward product conditioned on the words before or after the predicted token.

Autoregressive (AR) Language Model Examples

The famed OpenAI GPT-2 and GPT-3 are both stock AR models. XLNet, a language representation model utilizes autoregression along with autoencoding while avoiding their limitations, thus enjoying the best of both worlds. Other examples include BERT (Bidirectional Encoder Representations), RoBERTa, ALBERT, XLM, DistilBERT, ELECTRA.


  1. Autoregressive Models in Deep Learning: A Survey
  2. Forecasting Principals and Practice: 8.3. Autoregressive Models
  3. Summary of the Models:
  4. Understanding Language using XLNet with Autoregressive pre-training: Maggie Xiao
  5. Lecture 15: Autoregressive and Reversible Models: Department of Computer Science, University of Toronto
  6. Autoregression Models for Time Series Forecasting with Python: Machine Learning Mastery
  7. WaveNet: A generative model for raw audio