Comments on: Contextual Bandits and Reinforcement Learning

By: Parth Radia

Parth Radia — Thu, 16 Nov 2017 16:44:00 +0000

Ah, understood. I guess my next question is — if one has no reliable way to build a “gym” (simulation) for reinforcement learning, are contextual bandits the best initial approach?

The reason I am asking is because I have a domain where:
- There is no initial data (hence the bandits instead of collab. filt.)
- There is a way to collect context (hence the contextual bandits)
- There is no way to build simulations necessary for reinforcement learning.

By: surmenok

surmenok — Thu, 16 Nov 2017 04:59:00 +0000

Reinforcement learning models learn how to perform multiple actions. For example, in the game of chess, there can be a lot of moves before the outcome (win/draw/defeat) is observed.
Contextual bandits are a subset of reinforcement learning algorithms which are simpler: there is only one step before the outcome is observed. For example, you make one decision to select which link to show on a web page, and you get an outcome (and associated reward) after that: whether the user clicked on the link. In this sense contextual bandit is just a reinforcement learning algorithm reduced to one step.

By: Parth Radia

Parth Radia — Thu, 16 Nov 2017 03:12:00 +0000

Pavel, thanks for this. Bandit methods are pretty obscure and hard to learn about in comparison to collaborative filtering techniques.

The piece I’m not understanding is the delineation between contextual bandits and reinforcement learning.

Specifically: “If you get reinforcement learning algorithm with policy gradients and simplify it to a contextual bandit by reducing a number of steps to one, the model will be very similar to a supervised classification model.”

Could you expand a bit more on this?