首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
ECS7002P语言程序辅导、Python编程设计调试、Java、Python程序讲解 讲解留学生Processing|讲解SPSS
项目预算:
开发周期:
发布时间:
要求地区:
Assignment 2
ECS7002P - Artificial Intelligence in Games
November 9, 2020
In this assignment, you will implement a variety of reinforcement learning algorithms to find policies for the
frozen lake environment. Please read this entire document before you start working on the assignment.
1 Environment
The frozen lake environment has two main variants: the small frozen lake (Fig. 1) and the big frozen lake (Fig.
2). In both cases, each tile in a square grid corresponds to a state. There is also an additional absorbing state,
which will be introduced soon. There are four types of tiles: start (grey), frozen lake (light blue), hole (dark
blue), and goal (white). The agent has four actions, which correspond to moving one tile up, left, down, or
right. However, with probability 0.1, the environment ignores the desired direction and the agent slips (moves
one tile in a random direction). An action that would cause the agent to move outside the grid leaves the state
unchanged.
Figure 1: Small frozen lake
Figure 2: Big frozen lake
The agent receives reward 1 upon taking an action at the goal. In every other case, the agent receives zero
reward. Note that the agent does not receive a reward upon moving into the goal (nor a negative reward upon
moving into a hole). Upon taking an action at the goal or in a hole, the agent moves into the absorbing state.
Every action taken at the absorbing state leads to the absorbing state, which also does not provide rewards.
Assume a discount factor of γ = 0.9.
1
For the purposes of model-free reinforcement learning (or interactive testing), the agent is able to interact
with the frozen lake for a number of time steps that is equal to the number of tiles.
Your first task is to implement the frozen lake environment. Using either Python or Java, try to mimic the
Python interface presented in Listing 1.
Listing 1: Frozen lake environment.
The class EnvironmentModel represents a model of an environment. The constructor of this class receives
a number of states, a number of actions, and a seed that controls the pseudorandom number generator. Its
subclasses must implement two methods: p and r. The method p returns the probability of transitioning from
state to next state given action. The method r returns the expected reward in having transitioned from state to
next state given action. The method draw receives a pair of state and action and returns a state drawn according
to p together with the corresponding expected reward. Note that states and actions are represented by integers
starting at zero. We highly recommend that you follow the same convention, since this will facilitate immensely
the implementation of reinforcement learning algorithms. You can use a Python dictionary (or equivalent data
structure) to map (from and to) integers to a more convenient representation when necessary. Note that, in
general, agents may receive rewards drawn probabilistically by an environment, which is not supported in this
simplified implementation.
The class Environment represents an interactive environment and inherits from EnvironmentModel. The
constructor of this class receives a number of states, a number of actions, a maximum number of steps for
interaction, a probability distribution over initial states, and a seed that controls the pseudorandom number
generator. Its subclasses must implement two methods: p and r, which were already explained above. This
class has two new methods: reset and step. The method reset restarts the interaction between the agent and
the environment by setting the number of time steps to zero and drawing a state according to the probability
distribution over initial states. This state is stored by the class. The method step receives an action and returns
a next state drawn according to p, the corresponding expected reward, and a flag variable. The new state is
stored by the class. This method also keeps track of how many steps have been taken. Once the number of steps
matches or exceeds the pre-defined maximum number of steps, the flag variable indicates that the interaction
should end.
The class FrozenLake represents the frozen lake environment. Your task is to implement the methods p and
r for this class. The constructor of this class receives a matrix that represents a lake, a probability that the
agent will slip at any given time step, a maximum number of steps for interaction, and a seed that controls
the pseudorandom number generator. This class overrides the method step to indicate that the interaction
should also end when the absorbing state is reached. The method render is capable of rendering the state of
the environment or a pair of policy and value function.
The function play can be used to test your implementation of the environment before you try the next tasks.
4
2 Tabular model-based reinforcement learning
Your next task is to implement policy evaluation, policy improvement, policy iteration, and value iteration.
You may follow the interface suggested in Listing 2.
Listing 2: Tabular model-based algorithms.
The function policy evaluation receives an environment model, a deterministic policy, a discount factor, a
tolerance parameter, and a maximum number of iterations. A deterministic policy may be represented by an
array that contains the action prescribed for each state.
The function policy improvement receives an environment model, the value function for a policy to be
improved, and a discount factor.
The function policy iteration receives an environment model, a discount factor, a tolerance parameter, a
maximum number of iterations, and (optionally) the initial policy.
The function value iteration receives an environment model, a discount factor, a tolerance parameter, a
maximum number of iterations, and (optionally) the initial value function.
3 Tabular model-free reinforcement learning
Your next task is to implement Sarsa control and Q-learning control. You may follow the interface suggested in
Listing 3. We recommend that you use the small frozen lake to test your implementation, since these algorithms
may need many episodes to find an optimal policy for the big frozen lake.
Listing 3: Tabular model-free algorithms.
The function sarsa receives an environment, a maximum number of episodes, an initial learning rate, a
discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number
generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes
increases (for instance, eta[i] contains the learning rate for episode i).
The function q learning receives an environment, a maximum number of episodes, an initial learning rate,
a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number
generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes
increases (for instance, eta[i] contains the learning rate for episode i).
Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q
for a given state. This plays a large role in encouraging exploration.
4 Non-tabular model-free reinforcement learning
In this task, you will treat the frozen lake environment as if it required linear action-value function approximation.
Your task is to implement Sarsa control and Q-learning control using linear function approximation.
In the process, you will learn that tabular model-free reinforcement learning is a special case of non-tabular
model-free reinforcement learning. You may follow the interface suggested in Listing 4.
Listing 4: Non-tabular model-free algorithms.
The class LinearWrapper implements a wrapper that behaves similarly to an environment that is given to
its constructor. However, the methods reset and step return a feature matrix when they would typically return
a state s. Each row a of this feature matrix contains the feature vector φ(s, a) that represents the pair of action
and state (s, a). The method encode state is responsible for representing a state by such a feature matrix.
More concretely, each possible pair of state and action is represented by a different vector where all elements
except one are zero. Therefore, the feature matrix has |S||A| columns. The method decode policy receives a
parameter vector θ obtained by a non-tabular reinforcement learning algorithm and returns the corresponding
greedy policy together with its value function estimate.
The function linear sarsa receives an environment (wrapped by LinearWrapper ), a maximum number of
episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that
controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly
as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).
The function linear q learning receives an environment (wrapped by LinearWrapper ), a maximum number
of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that
controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly
as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).
The Q-learning control algorithm for linear function approximation is presented in Algorithm 1. Note that
this algorithm uses a slightly different convention for naming variables and omits some details for the sake of
simplicity (such as learning rate/exploration factor decay).
Algorithm 1 Q-learning control algorithm for linear function approximation
Input: feature vector φ(s, a) for all state-action pairs (s, a), number of episodes N, learning rate α, probability
of choosing random action ,
Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q
(Algorithm 1, Line 9). This plays a large role in encouraging exploration.
5 Main function
Your final implementation task is to write a program that uses all the algorithms that you have implemented for
this assignment. Your main function should behave analogously to the function presented in Listing 5. Using
the small frozen lake as a benchmark, find and render an optimal policy using policy iteration, value iteration,
Sarsa control, Q-learning control, linear Sarsa control, and linear Q-learning. For grading purposes, if your
main function does not call one of these algorithms, we will assume that it is not implemented correctly.
Listing 5: Main function.
6 Submission instructions
This assignment corresponds to 40% of the final grade for this module. You will work in groups of 3 students. The
deadline for submitting this assignment is December 11th, 2020. Penalties for late submissions will be applied
in accordance with the School policy. The submission cut-off date is 7 days after the deadline. Submissions
should be made through QM+. Submissions by e-mail will be ignored. Please always check whether the files
were uploaded correctly to QM+. Cases of extenuating circumstances have to go through the proper procedure
in accordance with the School policy. Only cases approved by the School in due time will be considered.
You will find the group selection page in QM+. You must be part of a group in QM+ before submitting
your assignment. Plagiarism leads to irreversible non-negotiable failure in the module. If you are unsure about
what constitutes plagiarism, please ask.
This assignment requires a group submission and an individual submission, which are detailed in the next
sections.
6.1 Group submission
The group submission must be a single zip file. This file must contain a single folder named group[group id].
This folder must contain a report and a folder named code.
The code folder must contain a file named README.txt, which explains how to run your main function (see
Section 5). Based solely on the correctness and clarity of your code, you will receive the following number of
points for accomplishing each of the following tasks:
1. Implementing the frozen lake environment [10/100]
2. Implementing policy iteration [10/100]
3. Implementing value iteration [10/100]
4. Implementing Sarsa control [10/100]
5. Implementing Q-learning [10/100]
6. Implementing Sarsa control using linear function approximation [10/100]
7. Implementing Q-learning control using linear function approximation [10/100]
The report must be a single pdf file. Other formats are not acceptable. The report must be excellently
organized and identified with your names, student numbers, and module identifier. You will receive the following
number of points for answering each of the following questions:
1. Explain how your code for this assignment is organized. Did you make implementation decisions that
deviate significantly from what we suggested? [10/100]
2. How many iterations did policy iteration require to find an optimal policy for the big frozen lake? How
many iterations did value iteration require? Which algorithm was faster? [10/100]
3. How many episodes did Sarsa control require to find an optimal policy for the small frozen lake? How
many episodes did Q-learning control require? Hint: you may use policy evaluation to compare the value
of each policy obtained by these algorithms to the value of an optimal policy [10/100]
4. In linear action-value function approximation, how can each element of the parameter vector θ be interpreted
when each possible pair of state s and action a is represented by a different feature vector φ(s, a)
where all elements except one are zero? Explain why the tabular model-free reinforcement learning algorithms
that you implemented are a special case of the non-tabular model-free reinforcement learning
algorithms that you implemented. [Bonus 5/100]
5. Try to find an optimal policy for the big frozen lake by tweaking the parameters for Sarsa control and
Q-learning control (maximum number of episodes, learning rate, and exploration factor). You must use
policy evaluation to confirm that the resulting policy is optimal. Even if you fail, describe your experience.
[Bonus 5/100]
10
6.2 Individual submission
Each student must submit a text that describes in no more than 300 words the role that each member of their
group had in the assignment. Your grade may be penalized if the individual submissions reveal that you have
not contributed enough to the assignment. The individual submission must be a single pdf file. Other formats
are not acceptable.
11
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代做ceng0013 design of a pro...
2024-11-13
代做mech4880 refrigeration a...
2024-11-13
代做mcd1350: media studies a...
2024-11-13
代写fint b338f (autumn 2024)...
2024-11-13
代做engd3000 design of tunab...
2024-11-13
代做n1611 financial economet...
2024-11-13
代做econ 2331: economic and ...
2024-11-13
代做cs770/870 assignment 8代...
2024-11-13
代写amath 481/581 autumn qua...
2024-11-13
代做ccc8013 the process of s...
2024-11-13
代写csit040 – modern comput...
2024-11-13
代写econ 2070: introduc2on t...
2024-11-13
代写cct260, project 2 person...
2024-11-13
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!