Join us to learn about reinforcement learning, the theory behind contextual bandits and how this applies to content personalization.
Personalizer, an award-winning AI service, aimed at democratizing real world reinforcement learning for context personalization. Its goal is to make reinforcement learning accessible to everyone, not just machine learning experts.
Personalizer is the result of a successful partnership between Microsoft Research and Azure Cognitive Services aimed at rapid technology transfer and innovation.
Agenda:
Intro - How can apps react to a changing world?
Introduction to reinforcement learning and contextual bandits
Overview of Azure Cognitive Services Personalizer
Where can you get help for Vowpal Wabbit and Personalizer
Q&A
Q: How is the presentation material linked to Bonsai platfrom from MSFT?
Bonsai focuses on simulation based RL - whereas the presentation today focuses on RL in the real world without access to a simulation. A key difference here is the amount of data available to train algorithms, and how effective they are with that amount of data
For example user preferences cannot be simulated
Q: i wonder how it does against a standard recommendation model and not just random choice.. do we have some comparisons for that?
Q: Is giving feedback or actin also a form of labelling?
Q: ie..if I have an article..would I have to define a positive reward or negative reward..
10:26:47 From Jack Gerrits : Yes reward is similar to a label, Generally the agent maximizes rewards and so it would be positive
Q: Typically reward value would be bounded between 0 and 1
Q: What is the name of the Reinf algorithm
10:31:56 From Jack Gerrits : I believe Rajan is doing contextual bandits with action dependent features - the default version in VW is using MTR as the CB type
Q: what does MTR stand for?
: VW is built on a reduction based architecture so it is internally using cost sensitive classification and stochastic gradient descent
multitask regression
Okay my bad - this is not with action dependent features - it's straight CB :)
Q: Why is it 86%? Should it not be 80 10 10 between the 3 actions?
10:40:20 From Jack Gerrits : The 20% is distributed between the 3 actions
Technically you can explore and choose the top action
Q: How is this loss calculated?
The cost function would be scenario specific - for example, click on an article
Q: : Do you have a script which works with confidence intervals?
Q: Pardon my noob ness, but is this in Python? VW (Which is C right?)?
11:13:05 From Jack Gerrits : VW has bindings to many languages, lots of people like to use it in Python
Q: Can you please share the (jupyter notebook
) with us too? Thanks
11:15:06 From Jack Gerrits : This example is using personalizer and so it requires an instance of the backing service here
11:16:12 From Jack Gerrits : Here is the Python tutorial for personalizer [ Ссылка ]
You can find samples and quickstarts here: [ Ссылка ]
Q: This is such a great session. Exactly what we needed to nudge us forward. I am learning python now.
Q: Is "apprentice" off-policy RL?
11:28:07 From Jack Gerrits : Apprentice mode allows you to begin logging training data before you actually start using the predictions from the model. There is a different feature in the service that handles counter factual for policy optimization
Q: Thank you for a great webinar! Very good information and resources were shared here. Development can only get better.
[ Ссылка ]
[ Ссылка ] would allow you to embed the client side of RL inside most environments
![](https://i.ytimg.com/vi/NgWRMEVEdvk/mqdefault.jpg)