Dialog policy optimization for low resource setting using Self-play and Reward based Sampling.
Tharindu MadusankaDurashi LangappuliThisara WelmillaUthayasanker ThayasivamSanath JayasenaPublished in: PACLIC (2020)
Keyphrases
- multi armed bandit
- optimization algorithm
- average reward
- natural language
- allocation policies
- agent receives
- reinforcement learning
- optimization process
- resource management
- combinatorial optimization
- random sampling
- global optimization
- partially observable environments
- optimal policy
- constrained optimization
- action selection
- reward function
- finite horizon
- conversational agents
- sample size
- optimal resource allocation
- optimization problems