首页
网站开发
桌面应用
管理软件
微信开发
App开发
嵌入式软件
工具软件
数据采集与分析
其他
首页
>
> 详细
Python语言编程调试、辅导data留学生程序、讲解Python编程 辅导留学生 Statistics统计、回归、迭代|解析Java程序
项目预算:
开发周期:
发布时间:
要求地区:
In this assignment, we have simulated a reinforcement learning based dog, whose life’s purpose, as it tends
to be, is to please its owner. You will define the details of the states, actions, and reward, and use the provided
implementation of Q-Learning to learn a policy. Then, you will extend the environment with more states and
actions, and see how well you can get the agent to do.
1 Task Description
We will be modeling this scenario in the Malmo environment. Each run (or episode) consists of a single day of
Odie putting something together for Jon, and ends when Odie is satisfied with what it has accomplished that day.
The day starts with a number of items, which Odie can pick up in any order, but since he’s a dog, he can hold a
maximum of three items at any time, and more unfortunately, doesn’t know how to drop any items. However,
Odie is a little magical, it can combine some items to create new ones (some that might be quite desirable to Jon)!
1.1 Provided Source Code
1.2 Overview of the Code
The basic Minecraft environment consists a number of predefined items (called items ) strewn in a circle around
the player agent. The playet agent has an inventory, limited to a maximum size of three, but can be changed using
inventory_limit . In terms of actions, based on the current inventory, the agent (Odie) can either pick up one of
the existing items (implemented via teleport command, which is why you won’t see the agent running around),
combine multiple items in the inventory to craft a new item (using the recipes available as food_recipes ), or
decides to present everything it has in its inventory to Jon ( present_gift ). How Jon reacts to receiving a gift
is stored in rewards_map , with reward for multiple items simply being the sum of rewards of each of the items.
We have provided an implementation of Q-Learning which follows an ε-greedy policy while following off-policy
updates. The source code should be fairly self-explanatory, since it follows the existing Malmo tutorials quite
closely (a related tutorial is tabular_q_learning.py , which is the solution to tutorial_6.py ). Go through the
implementation for details, and discuss on Campuswire if you have any doubts.
1.3 Setup and Running the Code
Let me introduce you to Odie the dog, the naive and earnest dog that belongs to Jon. Every day, while Jon is away
at work, Odie spends the whole day trying to find and make something to gift Jon. At the end of the day, when
Jon returns home, Odie presents what it did that day to Jon, expecting some reward. Since Jon is usually quite
tired, and in no mood to for useless gifts, usually just punishes Odie, unless Odie really brings him something cool.
Assignment : Dog’s Life!
We have provided two Python files: the primary one is hw_dog.py which contains the complete code to setup
the Malmo environment and default RL learning code. This file calls some of the methods in the second file,
hw_dog_submission.py , but this file is incomplete. You will be mostly changing the second file.
Assuming you have installed Malmo, all you need to do to run this assignment is to copy the two files above to
the Python_Examples folder, and after launching Minecraft, run python hw_dog.py . If everything run
successfully, the agent should be doing random things, since the implementation is incomplete (more on this later).
The output in the terminal should look like the following:
n= 1
1 Learning Q-Table: pumpkin, c_pumpkin_seeds, egg, present_gift,
Reward: -75
2 Learning Q-Table: egg, egg, pumpkin, present_gift, Reward: -55
3 Learning Q-Table: pumpkin, sugar, egg, c_pumpkin_pie,
present_gift, Reward: 100
4 Learning Q-Table: egg, pumpkin, sugar, c_pumpkin_pie,
present_gift, Reward: 100
5 Showing best policy: egg, egg, sugar, present_gift, with reward
-60.0
6 Learning Q-Table: pumpkin, egg, sugar, c_pumpkin_seeds,
present_gift, Reward: -85
7 Learning Q-Table: pumpkin, egg, egg, c_pumpkin_seeds,
present_gift, Reward: -100
8 Learning Q-Table: egg, sugar, present_gift, Reward: -35
9 Learning Q-Table: sugar, egg, present_gift, Reward: -35
10 Showing best policy: pumpkin, sugar, egg, c_pumpkin_seeds,
present_gift, with reward -85.0
...
2 What Do I Submit?
Here we’ll describe what exactly you need to submit to the assignment on Canvas.
1. Code: Define the State (10 points): As the first simple exercise, implement the get_curr_state function.
Given the list of items in the inventory, this function returns an indexable Python object (hint: for example,
a tuple) that represents the state of the inventory. Keep in mind that any possible combination of items will
be a different state, and further, the order of the items does not matter (however, the quantity of item of
each type does). Solution will be quite small (ours was 5 lines, could have been smaller).
2. Number: Number of States (5 points): Assuming the environment contains one copy of each item
(including the crafted items), how many states do you think are possible? As when defining the state space
above, keep in mind that ordering between items in the inventory does not matter. Just provide a number.
3. Code: Implement ε-Greedy Policy (15 points): The choose_action method currently ignores the eps
and q_table values and just returns a random action. Instead, implement the ε-greedy policy that does
exactly the above with probability eps , but with 1-eps it picks the action with the highest Q-value (you
can get a list of actions and q-values using q_table[curr_state].items() ). Note that in case of multiple
actions having the same maximum q-value, you should randomly pick any one of them. Solution can easily
be achieved in 10 − 15 lines, but probably in less.
4. Number: Reward From Best Policy (10 points): Given the list of items, the recipes, and the rewards,
what is the maximum possible reward that Odie will get at the end of any episode? Keep in mind that the
inventory can only hold a maximum of three items, and Odie cannot drop any item (they get removed only
when they are part of a recipe). Use this in the optimality test ( is_solution ), and submit the number.
5. Output: Smartest Odie Can Be (15 points): Now your implementation should be complete. Run the agent
till it ends with the output indicating it has found the best policy. Submit the last three lines (starts with
XXX Showing best policy ... , then Found solution , and then Done ).
6. Code: Expanding the World (10 points): We will now add more items and recipes to the world. To begin
with, add the following items (one red_mushroom and two planks ) and recipes (a bowl made from two
planks , and a mushroom_stew made from a bowl and a red_mushroom ). Also include a reward of 5 for
the red_mushroom , −5 for the planks , −1 for the bowl , and 100 for the mushroom_stew .
7. Number: Number of States 2 (5 points): With this expanded list of items and recipes (and assuming the
environment has one of each, including crafted items), how many states do you think are possible now?
Just provide a number.
8. Number: Reward From Best Policy 2 (10 points): Given this expanded list of items, recipes, and rewards,
now what is the maximum possible reward that Odie will get? Use this in the optimality test ( is_solution ),
and submit the number.
9. Output: Running on the Expanded World (10 points): Given this implementation of the the expanded
world, run the same code as before. Note that the world is much bigger now (in terms of number of
11. Comments: Any comments about your submission that you want to bring to our attention as we are grading
it.
states), and thus do not be disappointed if your agent does not converge to the optimal policy any time
soon (or at all). Run your agent for at least 200 episodes, and submit at least the last 20 lines starting with
XXX Showing best policy ... (if on a Unix system, grep and tail will be your friends).
10. Extra Credit: Improving the Agent (10 points) If you are feeling adventurous, you can try to run with
n > 1, or change α, γ, and ε, to see if you get better results. You should basically see if you can find a policy
that is significantly better than the previous policy, in the same number of episodes (200). Describe in a line
or two what you changed, and submit the last 20 lines that start with XXX Showing best policy ... , as
in the previous part. You will be graded on a combination of what improvements you tried and achieved.
软件开发、广告设计客服
QQ:99515681
邮箱:99515681@qq.com
工作时间:8:00-23:00
微信:codinghelp
热点项目
更多
代写math 1151, autumn 2024 w...
2024-11-14
代做comp4336/9336 mobile dat...
2024-11-14
代做eesa01 lab 2: weather an...
2024-11-14
代写comp1521 - 24t3 assignme...
2024-11-14
代写nbs8020 - dissertation s...
2024-11-14
代做fin b377f technical anal...
2024-11-14
代做ceic6714 mini design pro...
2024-11-14
代做introduction to computer...
2024-11-14
代做cs 353, fall 2024 introd...
2024-11-14
代做phy254 problem set #3 fa...
2024-11-14
代写n1569 financial risk man...
2024-11-14
代写csci-ua.0202 lab 3: enco...
2024-11-14
代写econ2226: chinese econom...
2024-11-14
热点标签
mktg2509
csci 2600
38170
lng302
csse3010
phas3226
77938
arch1162
engn4536/engn6536
acx5903
comp151101
phl245
cse12
comp9312
stat3016/6016
phas0038
comp2140
6qqmb312
xjco3011
rest0005
ematm0051
5qqmn219
lubs5062m
eee8155
cege0100
eap033
artd1109
mat246
etc3430
ecmm462
mis102
inft6800
ddes9903
comp6521
comp9517
comp3331/9331
comp4337
comp6008
comp9414
bu.231.790.81
man00150m
csb352h
math1041
eengm4100
isys1002
08
6057cem
mktg3504
mthm036
mtrx1701
mth3241
eeee3086
cmp-7038b
cmp-7000a
ints4010
econ2151
infs5710
fins5516
fin3309
fins5510
gsoe9340
math2007
math2036
soee5010
mark3088
infs3605
elec9714
comp2271
ma214
comp2211
infs3604
600426
sit254
acct3091
bbt405
msin0116
com107/com113
mark5826
sit120
comp9021
eco2101
eeen40700
cs253
ece3114
ecmm447
chns3000
math377
itd102
comp9444
comp(2041|9044)
econ0060
econ7230
mgt001371
ecs-323
cs6250
mgdi60012
mdia2012
comm221001
comm5000
ma1008
engl642
econ241
com333
math367
mis201
nbs-7041x
meek16104
econ2003
comm1190
mbas902
comp-1027
dpst1091
comp7315
eppd1033
m06
ee3025
msci231
bb113/bbs1063
fc709
comp3425
comp9417
econ42915
cb9101
math1102e
chme0017
fc307
mkt60104
5522usst
litr1-uc6201.200
ee1102
cosc2803
math39512
omp9727
int2067/int5051
bsb151
mgt253
fc021
babs2202
mis2002s
phya21
18-213
cege0012
mdia1002
math38032
mech5125
07
cisc102
mgx3110
cs240
11175
fin3020s
eco3420
ictten622
comp9727
cpt111
de114102d
mgm320h5s
bafi1019
math21112
efim20036
mn-3503
fins5568
110.807
bcpm000028
info6030
bma0092
bcpm0054
math20212
ce335
cs365
cenv6141
ftec5580
math2010
ec3450
comm1170
ecmt1010
csci-ua.0480-003
econ12-200
ib3960
ectb60h3f
cs247—assignment
tk3163
ics3u
ib3j80
comp20008
comp9334
eppd1063
acct2343
cct109
isys1055/3412
math350-real
math2014
eec180
stat141b
econ2101
msinm014/msing014/msing014b
fit2004
comp643
bu1002
cm2030
联系我们
- QQ: 9951568
© 2021
www.rj363.com
软件定制开发网!