This is a post I will write when I have 48 hours free to crystallize all of these thoughts. Briefly, one thing I have been thinking about recently is the prototypical problem of optimizing one’s time—exploring new things to do versus doubling down on things done well.
- It’s easy to recognize that this sampling process feels like a multi-armed bandit problem.
- However, I think most MAB models miss something key: life has superlinear returns - see https://www.paulgraham.com/superlinear.html.
- If life has superlinear returns, then we should use a nonstationary formulation which has some stochastic exponential function as the rewards over time—where the bandit doesn’t actually know which activities are duds and which are superlinear (or perhaps has very weak, or even wrong priors on which are truly superlinear).
- I’m curious:
- Can classic RL algorithms can be repurposed for this setting?
- How will different encoded priors about ``which activities are good" affect the agent?
- How can we encode how aware an agent is? If one agent recognizes exponential growth faster, we would expect their long-run gains to be higher because they can pivot quicker.