论文标题
增强学习最佳执行的模块化框架
A Modular Framework for Reinforcement Learning Optimal Execution
论文作者
论文摘要
在本文中,我们开发了一个模块化框架,用于将强化学习应用于最佳贸易执行问题。该框架的设计考虑了灵活性,以便简化不同的仿真设置的实现。我们专注于环境并分解必要的要求,而不是关注代理和优化方法,而不是在加强学习框架下模拟最佳的贸易执行,例如数据预处理,观察的构建,行动处理,儿童订单执行,基准的模拟,基准的模拟,奖励计算,奖励计算等。我们提供了每个组件的互动,探索了他们的各个互动,探索了\&exportions的不同互动,探索了\&exportions的不同互动,并探索\&exportions的不同互动,并探索\&comportion的互动范围。诱导模拟,突出了模拟与真实市场行为之间的分歧。我们通过设置展示了模块化的实现,该设置是按照时间加权的平均价格(TWAP)订单提交时间表,允许代理人专门放置限制订单,通过迭代限制订单(LOB)的快照(LOB)模拟其执行,并计算出Rewards的重新计算,并将其计算为\ $的价格,而不是在Twap Benchmark Benchmark Benchermarch的价格上,以下是\ $的进步。我们还制定了评估程序,该程序在训练范围的间隔内纳入了给定代理的迭代重新训练和评估,并模仿代理在随着新市场数据的可用性而连续进行重新培训时的表现,并模拟了算法提供者在当前监管框架下限制执行的监测实践。
In this article, we develop a modular framework for the application of Reinforcement Learning to the problem of Optimal Trade Execution. The framework is designed with flexibility in mind, in order to ease the implementation of different simulation setups. Rather than focusing on agents and optimization methods, we focus on the environment and break down the necessary requirements to simulate an Optimal Trade Execution under a Reinforcement Learning framework such as data pre-processing, construction of observations, action processing, child order execution, simulation of benchmarks, reward calculations etc. We give examples of each component, explore the difficulties their individual implementations \& the interactions between them entail, and discuss the different phenomena that each component induces in the simulation, highlighting the divergences between the simulation and the behavior of a real market. We showcase our modular implementation through a setup that, following a Time-Weighted Average Price (TWAP) order submission schedule, allows the agent to exclusively place limit orders, simulates their execution via iterating over snapshots of the Limit Order Book (LOB), and calculates rewards as the \$ improvement over the price achieved by a TWAP benchmark algorithm following the same schedule. We also develop evaluation procedures that incorporate iterative re-training and evaluation of a given agent over intervals of a training horizon, mimicking how an agent may behave when being continuously retrained as new market data becomes available and emulating the monitoring practices that algorithm providers are bound to perform under current regulatory frameworks.