汤普森采样在在线网络收入管理中的应用

PDF文件

revenue

management

pricing

multi-armed

bandit

下载需积分: 9 | 636KB | 更新于2024-07-09 | 196 浏览量 | 举报收藏

立即下载

"这篇研究论文探讨了在网络收入管理中如何运用汤普森采样(Thompson Sampling)策略来实现最佳动态定价，以在有限的销售季节内最大化零售商的收入。在库存有限、需求函数参数未知的情况下，零售商需要平衡探索与开发的权衡，即在初期通过尝试不同价格来学习需求（探索），然后根据所学知识调整价格以最大化剩余销售期的收入（开发）。论文提出了一个基于汤普森采样的动态定价算法，该算法在理论上具备良好的性能保证，并在数值模拟中表现出色。此外，算法还被扩展到更复杂的多臂强盗问题（multi-armed bandit problem），适用于有资源约束的情况，以及在其他收益管理场景的应用。" 本文的研究核心是解决网络收入管理中的关键问题，即如何在信息不完全的情况下制定最优定价策略。传统的多臂强盗问题通常关注于在多个不确定的收益源之间进行选择，以最大化累积奖励。汤普森采样是一种有效的探索-开发策略，它在随机优化和贝叶斯决策理论中有着广泛的应用。在本论文中，作者将这一方法应用于网络收入管理，通过动态地调整价格来估计和利用需求函数的未知参数。在实际操作中，零售商面临的问题是需求对价格的敏感性可能因产品、市场和时间而异，因此需要通过实验来确定最佳定价。汤普森采样允许零售商在探索新价格以了解需求和利用现有知识以提高收入之间找到平衡。算法的工作原理是，每次决策时，它会从当前的后验分布中抽样，以预测不同价格可能带来的期望收入，然后选择抽样结果中期望收入最高的价格。论文通过理论分析证明了该算法的性能优势，并通过数值模拟验证了其在多种情境下的表现。此外，作者还展示了如何将该算法推广到更复杂的情况，如存在多种资源限制的多臂强盗问题，这在实践中可能是常见的，例如，零售商可能需要考虑库存、生产能力或其他资源的限制。这篇研究论文为网络收入管理提供了一个新的、基于汤普森采样的解决方案，强调了在探索和开发之间的有效平衡对于优化收入的重要性。这种方法不仅对理论研究有价值，也为实际的商业决策提供了实用工具。

Ferreira, Simchi-Levi, and Wang: Online Network Revenue Management using Thompson Sampling

Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 9

When the retailer observes a realized demand instance under the oﬀered price p

∈{p

, p

, . . . , p

it obtains some information about parameters θ

and θ

, which enables the retailer to learn demand

not only for the oﬀered price, but also for prices that are not oﬀered.

Relationship to the Multi-Armed Bandit Problem The model formulated above is a

generalization of the multi-armed bandit (MAB) problem that has been extensively studied in the

statistics and operations research literature – where each price is an “arm” and revenue is the

“reward” – except for two main deviations. First, our formulation allows for the network revenue

management setting (Gallego and Van Ryzin (1997)) where multiple products consuming common

resources are sold. Second, there are inventory constraints present in our setting, whereas there are

no such constraints in the MAB model.

We note that the presence of inventory constraints signiﬁcantly complicates the problem, even for

the special case of a single product. In the MAB setting, if mean revenue associated with each price

vector is known, the optimal strategy is to choose a price vector with the highest mean revenue. But

in the presence of limited inventory, a mixed strategy that chooses multiple price vectors over the

selling season may achieve signiﬁcantly higher revenue than any single price strategy. Therefore,

a good pricing strategy should converge not to a single price, but to a distribution of (possibly)

multiple prices. Another challenging task in the analysis is to estimate the time when the inventory

of each resource runs out, which is itself a random variable depending on the pricing policy used

by the retailer. Such estimation is necessary for computing the retailer’s expected revenue. This is

in contrast to classical MAB problems for which the process always ends at a ﬁxed period.

Our model is also closely related to the models studied in Badanidiyuru et al. (2013) and Besbes

and Zeevi (2012). Badanidiyuru et al. (2013) considers a multi-armed bandit problem with global

resource constraints. We will discuss this problem and extend our algorithms to this setting in

Section 4.3. Besbes and Zeevi (2012) studies a similar network revenue management model with

continuous time and unknown demand, considering both discrete and continuous price sets. Our

model can incorporate their setting by discretizing time, and we will discuss the extension to

continuous price sets in Section 4.1.

2.2. Thompson Sampling with Fixed Inventory Constraints

In this section, we propose our ﬁrst Thompson sampling based algorithm for the discrete price

model described in Section 2.1. For each resource j ∈ [M], we deﬁne a ﬁxed constant c

:= I

/T .

Given any demand parameter ρ ∈ Θ, we deﬁne the mean demand under ρ as the expectation

associated with CDF F (x

, . . . , x

; p

, ρ) for each product i ∈ [N] and price vector k ∈ [K]. We

denote by d = {d

}

i∈[N],k∈[K]

the mean demand under the true model parameter θ.

Ferreira, Simchi-Levi, and Wang: Online Network Revenue Management using Thompson Sampling

10 Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!)

We present our Thompson Sampling with Fixed Inventory Constraints algorithm (TS-ﬁxed for

short) in Algorithm 1. Here, “TS” stands for Thompson sampling, while “ﬁxed” refers to the fact

that we use ﬁxed constants c

for all time periods as opposed to updating c

over the selling season

as inventory is depleted; this latter idea is incorporated into the algorithm we present in Section

2.3.

Algorithm 1 Thompson Sampling with Fixed Inventory Constraints (TS-ﬁxed)

Repeat the following steps for all periods t = 1, ..., T :

1. Sample Demand: Sample a random parameter θ(t) ∈Θ according to the posterior distribution

of θ given history H

t−1

. Let the mean demand under θ(t) be d(t) = {d

(t)}

i∈[N],k∈[K]

2. Optimize Prices given Sampled Demand: Solve the following linear program, denoted by

LP(d(t)):

LP(d(t)) : max

k=1

(

i=1

(t))x

subject to

k=1

(

i=1

(t))x

≤c

, ∀j ∈[M ]

k=1

≤1

≥0, k ∈[K].

Let x(t) = (x

(t), . . . , x

(t)) be the optimal solution to LP(d(t)).

3. Oﬀer Price: Oﬀer price vector P (t) = p

with probability x

(t), and choose P (t) = p

∞

with

probability 1 −

k=1

(t).

4. Update Estimate of Parameter: Observe demand D(t). Update the history H

= H

t−1

∪

{P (t), D(t)} and the posterior distribution of θ given H

Steps 1 and 4 are based on the Thompson sampling algorithm for the classical multi-armed

bandit setting, whereas steps 2 and 3 are added to incorporate inventory constraints. In step 1 of the

algorithm, we randomly sample parameter θ(t) according to the posterior distribution of unknown

demand parameter θ. This step is motivated by the original Thompson sampling algorithm for the

classical multi-armed bandit problem. A novel idea of the Thompson sampling algorithm is to use

random sampling from the posterior distribution to balance the exploration-exploitation tradeoﬀ.

To be more precise, let us consider an example when there is unlimited inventory. Without loss of

generality, let us assume that price vector p

has the highest expected revenue under the posterior

distribution in the current period. If the retailer acts greedily (i.e. focusing only on the exploitation

Ferreira, Simchi-Levi, and Wang: Online Network Revenue Management using Thompson Sampling

Article submitted to Operations Research; manuscript no. (Please, provide the manuscript number!) 11

objective), it would maximize the expected revenue in this period by choosing p

with probability

one. However, there is no guarantee that p

is indeed the optimal price under the true demand. In

Thompson sampling, the retailer balances the exploration-exploitation tradeoﬀ by using demand

values that are randomly sampled, which means that there is a positive probability that the retailer

will choose a price vector other than p

, thus achieving the exploration objective. Guaranteeing

positive probability to pursue each objective - exploration and exploitation - is essential to discover

the true demand parameter over time (cf. Harrison et al. 2012).

The algorithm diﬀers from ordinary Thompson sampling in steps 2 and 3. In step 2, the retailer

solves a linear program, LP(d(t)), which identiﬁes the optimal mixed price strategy that maximizes

expected revenue given the sampled parameters. The ﬁrst constraint speciﬁes that the average

resource consumption in this time period cannot exceed c

, the average inventory available per

period. The second constraint speciﬁes that the sum of probabilities of choosing a price vector

cannot exceed one. In step 3, the retailer randomly oﬀers one of the K price vectors (or p

∞

)

according to probabilities speciﬁed by the optimal solution of LP(d(t)). Finally, in step 4, the

algorithm updates the posterior distribution of θ given H

. Such Bayesian updating is a simple and

powerful tool to update belief probabilities as more information – customer purchase decisions in

our case – becomes available. By employing Bayesian updating in step 4, we are ensured that as

any price vector p

is oﬀered more and more times, the sampled mean demand associated with

for each product i becomes more and more centered around the true mean demand, d

(cf.

Freedman 1963).

We note that the LP deﬁned in step 2 is closely related to the LP used by Gallego and Van Ryzin

(1997), where they consider a network revenue management problem in the case of known demand.

Their pricing algorithm is essentially a special case of Algorithm 1 where they solve LP(d), i.e,

LP(d(t)) with d(t) = d, in every time period. Moreover, they show that the optimal value of LP(d)

is an upper bound on the expected optimal revenue that can be achieved in such a network revenue

management setting; in Section 3.1.1 we present this upper bound and discuss the similarities

between the two linear programs.

Next we illustrate the application of our TS-ﬁxed algorithm by providing two concrete examples.

For simplicity, in both examples we assume that the prior distribution of demand for diﬀerent

prices are independent; however, the deﬁnition of TS-ﬁxed and the theoretical results in Section 3.1

are quite general and allow the prior distribution to be arbitrarily correlated for diﬀerent prices.

As mentioned earlier, this enables the retailer to learn the mean demand not only for the oﬀered

price, but also for prices that are not oﬀered.

剩余51页未读，继续阅读

weixin_38684335

粉丝: 1

汤普森采样在在线网络收入管理中的应用

机器学习-强化学习-汤普森采样

汤姆森数据分析师说明书

广告优化：使用强化学习算法（如汤普森采样和上限可信度）来优化最佳广告

汤普森抽样：预测游戏和市场中的行为-研究论文

评估昂贵函数的多目标优化算法：汤普森采样高效多目标优化（TSEMO）算法-matlab开发

TS-EMO：此存储库包含“汤普森采样有效的多目标优化”（TSEMO）的源代码

Matlab_这个存储库包含汤普森采样高效多目标优化TSEMO的源代码.zip

sfg-mssc-beer-service：模仿约翰·汤普森的mssc-beer-service存储库

regex-thompson:汤普森算法

约翰-汤普森钢琴教程1.pdf

tucl：肯·汤普森（Ken Thompson）于1976年撰写的有关Unix Shell的第一篇论文经许可进行了扫描，转录和重新分发

贝岭的matlab的代码-Ken-Thompson-papers:“如有疑问，请使用蛮力。”——肯尼斯·莱恩·汤普森

克里斯·汤普森

汤普森采样算法在多目标优化中的应用与matlab实现

格兰-汤普森棱镜温度影响透射光强研究

Unix Shell开创性论文重发：汤普森1976年原始手稿

探索regex-thompson: C++实现汤普森算法

Spring学习笔记（五）：JDBCTemplate+事务管理

基于Vue3和TypeScript的企业级前端开发脚手架模板_包含Vue3全家桶集成TypeScript支持ESLint代码规范Prettier格式化Jest单元测试Cypress.zip

最新资源