// WIP: Context

May 2026

Words in Context

In 1957, J.R. Firth articulated a profound insight about how words derive meaning: “You shall know a word by the company it keeps.” This observation aligns closely with Zellig Harris's Distributional Hypothesis, which links word distributions and their semantics.

The Distributional Hypothesis later served as a foundational principle in statistical NLP.

At the height of the deep learning resurgence in the 2010s, this principle underlies Word2Vec (Mikolov et al., 2013), which popularized two neural approaches to learning word embeddings.

These two neural approaches are: 1) predicting a center word given its surrounding context words, known as the continuous bag of words (CBOW) model, and 2) predicting context words from a given center word, known as the skip-gram model.

Mathematically, CBOW can be formulated as

P_{\mathrm{CBOW}}(w_t \mid w_{t-k}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+k})

where \(w_t\) is some center word and \(w_{t-k}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+k}\) are some context words surrounding \(w_t\).

Meanwhile, skip-gram can be formulated as

\prod_{\substack{-k \le j \le k \\ j \ne 0}} P_{\mathrm{skip\text{-}gram}}(w_{t+j} \mid w_t)

where \(w_t\) is some center word and \(w_{t-k}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+k}\) are some context words surrounding \(w_t\) that are to be predicted.