![Python Reinforcement Learning](https://wfqqreader-1252317822.image.myqcloud.com/cover/708/36698708/b_36698708.jpg)
Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/9998fe70-e372-43d5-83fc-60506a891aa7.png?sign=1739325026-tWjkiLGfHD5Ky1JYboyAT1Im01S3E4bk-0-dcabbfb56cf2a3d62bc0a1d2718517dd)
We define as a reward probability received by moving from state
to
while performing an action a:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/63181fd5-1f44-4a47-9106-93d4f684b535.png?sign=1739325026-wZaXd95zrhtiPILSwMavpti3P9Le2QJh-0-46a85c3b18138cd08ff7f3731e53955c)
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/0fd31d5b-6201-49a3-a727-371141a0a26c.png?sign=1739325026-8igPkbLT8vP8MyzkU1TiJvE5LTKLJUh1-0-1af38eab3124d6a3d9609fca129e27bf)
We know that the value function can be represented as:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/a2ddd0f4-3b8d-463a-bf71-138e96be4483.png?sign=1739325026-7DLYhumBuvtroXrI0dK7LfpCN1ITj5DL-0-113408138d42cf16751cf19f0f5cd623)
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/5b19b61e-302c-4a2c-aafd-e9f9d7025348.png?sign=1739325026-XDlxHFEctxSEJCC2QBOWsp5eEX20BuA0-0-aadd3400406badc52bcfe6f13d7104c4)
We can rewrite our value function by taking the first reward out:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/a994ac86-d116-4c23-b858-5e3e298a1edf.png?sign=1739325026-ba9RCpAyC86j7tkTyuMmDqEgaJUvBjpt-0-9be6acbf969f3cb8fa1d7175da63ce43)
The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/6896e6b0-a7b4-44e2-b4e4-150e035dd7d2.png?sign=1739325026-ldydOU4WEVUTAAK1opeeqspMWoH3zXyI-0-a94386af5256236784a254e5a6804f1c)
In the RHS, we will substitute from equation (5) as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/0684dec7-058e-4913-a05e-086c976c510f.png?sign=1739325026-lhpgSo9LjYaJbXWrhEoBjdQbRb9JEMBh-0-8fdfbdde042eaa42997e8172a517fe1e)
Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/ad65380c-9d36-4c69-90ed-0e3b9f046a75.png?sign=1739325026-paLOuAhjse3Voq5lOa7hYFhJcDw1VP9q-0-bdd2fdd337ce7c0c5b736c3fc58bde7c)
So, our final expectation equation becomes:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/c50dbf15-94f8-48fe-bb5c-03043ee0ae1b.png?sign=1739325026-Ror1XM530VS8G27HuzYLCu3Tyn3TkFok-0-e3e6d5237a11d6dd66ead35346734a26)
Now we will substitute our expectation (7) in value function (6) as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/a48c08cf-988c-423c-8c0f-3da83cbff854.png?sign=1739325026-jPH8Fltk0aOnsigqPYNjljJD4yARD24M-0-fa563304c15ad41792f0ab4e7a91b8b2)
Instead of , we can substitute
with equation (6), and our final value function looks like the following:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/5786899a-68e1-496a-9053-6da456535804.png?sign=1739325026-SuvoysCasINvE0ZhTsr21cvamYjSuUna-0-f725e50bfc27e21bce3df6f6ee80f935)
In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/c083a54f-b77b-4245-9c4a-d8650d8cf598.png?sign=1739325026-CaQ7W8HAKvhKsqTJmGdchaQFNG71Fz39-0-d368f98e4541c6e19b32c2f4dbfcab25)
Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.