Summary  
This chapter explains how to use conjugate priors—specifically the Beta distribution for Bernoulli models and the Dirichlet distribution for Multinomial models—to perform efficient Bayesian updating of model parameters as new data arrives.

General domain of usage  
Bayesian inference in machine learning

Understanding how to update beliefs about model parameters as new data arrives is crucial in machine learning. This is where the concept of **conjugate priors** becomes powerful, especially for the Bernoulli and Multinomial distributions. For binary outcome models (**Bernoulli**), the **Beta distribution** serves as a conjugate prior, while for categorical models (**Multinomial**), the **Dirichlet distribution** plays this role. Using these conjugate priors allows you to update your uncertainty about probabilities in a mathematically convenient way as you observe more data, making Bayesian inference both tractable and intuitive.

- The **Beta distribution** is a probability distribution over values between 0 and 1, parameterized by two positive values, often denoted as $$α$$ (alpha) and $$β$$ (beta). When used as a prior for the Bernoulli parameter (the probability of success), it expresses prior beliefs about the likelihood of that parameter.

- The **Dirichlet distribution** generalizes the Beta distribution to multiple categories. It is parameterized by a vector of positive values, one for each category, and is used as a prior for the probability vector in a Multinomial distribution. It expresses prior beliefs about the probabilities of each possible category.

Definition

The main advantage of using **conjugate priors** like the **Beta** for Bernoulli models and the **Dirichlet** for Multinomial models is the mathematical simplicity they provide for parameter updating. When a conjugate prior is combined with its corresponding likelihood, the resulting posterior distribution is in the same family as the prior. This property means that after observing new data, you can update your beliefs about the parameters simply by updating the parameters of the prior distribution, without complex calculations. In machine learning, this makes **Bayesian updating** efficient and scalable, especially when iteratively learning from streaming or batch data. This intuitive updating process is one reason why conjugate priors remain central to Bayesian approaches in practical ML applications.

Which of the following statements about conjugate priors for Bernoulli and Multinomial models are correct

Build strong intuition for probability distributions central to machine learning. Explore the exponential family, Gaussian, Bernoulli, and Multinomial distributions, and understand their roles in modeling, loss functions, and inference, with minimal code and clear visualizations.

Establish the core probabilistic concepts and their relevance to machine learning models.

Dive into the Gaussian distribution, its properties, and its foundational role in machine learning.

Explore the Bernoulli and Multinomial distributions, their intuition, and their use in classification and modeling.

Connect probability distributions to the design and interpretation of loss functions in machine learning.

Conjugate Priors for Bernoulli and Multinomial Models