Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach

<div>(a) The effect of nonoptimality in demonstrations. The experiment setting: initial nonoptimal demonstration “A5” with 60% optimality and 100 demonstrations in the first stage (see Figure <a href="../fig3/">3</a>). (b) The case where all demonstrations in the first stage are optimal but sparse. The experiment setting: initial sparse demonstration “B2” and 20 demonstrations in the first stage (see Figure <a href="../fig3/">3</a>). Two kinds of data are provided during the experiment: evaluative feedbacks related to the <span class="nowrap"><svg height="8.98583pt" id="M321" style="vertical-align:-0.2324905pt" version="1.1" viewbox="-0.0498162 -8.75334 38.1759 8.98583" width="38.1759pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M303 0V28C221 34 213 39 213 125V525C213 610 221 616 303 622V650H38V622C120 616 128 610 128 525V125C128 40 120 34 38 28V0H303Z"></path></g><g transform="matrix(.013,0,0,-0.013,4.433,0)"><path d="M631 18C609 24 585 35 559 65C534 91 514 117 478 169C448 214 406 281 389 313C462 346 516 399 516 485C516 545 490 590 449 616C412 641 363 650 290 650H42V622C120 615 128 612 128 527V125C128 40 120 34 38 28V0H300V28C221 34 212 40 212 125V284H244C295 284 312 272 329 244C359 195 395 133 430 84C475 19 516 -3 592 -7C603 -8 615 -8 627 -8L631 18ZM212 316V563C212 591 215 602 223 607C231 613 248 617 277 617C352 617 423 577 423 469C423 415 407 375 368 345C343 324 310 316 260 316H212Z"></path></g><g transform="matrix(.013,0,0,-0.013,12.506,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g><g transform="matrix(.013,0,0,-0.013,19.5,0)"><path d="M43 650V622C120 616 128 612 128 526V124C128 39 120 33 34 27V0H270C392 0 492 25 567 83C643 141 690 230 690 350C690 444 655 517 605 565C543 625 450 650 323 650H43ZM213 547C213 587 217 598 226 604C236 612 262 617 304 617C371 617 429 604 474 576C554 529 592 439 592 336C592 176 505 36 319 36C246 36 213 55 213 131V547Z"></path></g><g transform="matrix(.013,0,0,-0.013,29.29,0)"><path d="M614 175C564 76 510 21 408 21C256 21 146 149 146 336C146 488 235 629 402 629C510 629 570 586 597 480L626 488C620 541 614 582 606 638C578 643 510 665 429 665C206 665 44 527 44 316C44 157 153 -15 402 -15C474 -15 558 5 586 11C604 45 629 119 643 165L614 175Z"></path></g></svg>,</span> <span class="nowrap"><svg height="9.35772pt" id="M322" style="vertical-align:-0.04981995pt" version="1.1" viewbox="-0.0498162 -9.3079 37.4321 9.35772" width="37.4321pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g><g transform="matrix(.013,0,0,-0.013,7.072,0)"><path d="M54 437L27 408L31 397H101V103C101 37 94 32 30 26V0H266V25C187 33 180 36 180 110V397H288C299 404 304 428 298 437H180V477C179 562 190 610 203 630C214 647 230 659 256 659C289 659 318 641 337 622C346 612 355 612 364 619C374 627 380 635 383 643C388 655 387 667 378 678C362 697 333 710 299 712C260 707 225 689 189 659C135 613 119 563 112 541S101 490 101 458V437H54Z"></path></g><g transform="matrix(.013,0,0,-0.013,10.92,0)"><path d="M43 650V622C120 616 128 612 128 526V124C128 39 120 33 34 27V0H270C392 0 492 25 567 83C643 141 690 230 690 350C690 444 655 517 605 565C543 625 450 650 323 650H43ZM213 547C213 587 217 598 226 604C236 612 262 617 304 617C371 617 429 604 474 576C554 529 592 439 592 336C592 176 505 36 319 36C246 36 213 55 213 131V547Z"></path></g><g transform="matrix(.013,0,0,-0.013,20.358,0)"><path d="M729 650H468V622C546 615 554 610 554 527V365H213V527C213 610 221 616 297 622V650H38V622C120 616 128 611 128 527V123C128 39 120 34 40 28V0H303V28C221 34 213 40 213 123V322H554V123C554 39 546 34 460 28V0H728V28C647 34 639 39 639 123V527C639 610 647 615 729 622V650Z"></path></g><g transform="matrix(.013,0,0,-0.013,30.316,0)"><path d="M493 503C489 551 484 614 483 650H43V622C120 616 128 611 128 525V126C128 40 120 34 40 28V0H312V28C221 34 213 40 213 126V307H316C407 307 412 296 424 227H453V420H424C412 355 407 346 316 346H213V584C213 613 216 616 246 616H322C398 616 419 607 436 579C449 559 455 539 464 499L493 503Z"></path></g></svg>,</span> and policy combination (first horizontal axis), and state-action pairs in extraoptimal demonstrations used in the <svg height="8.70527pt" id="M323" style="vertical-align:-0.1802902pt" version="1.1" viewbox="-0.0498162 -8.52498 38.3194 8.70527" width="38.3194pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M861 0V28C774 35 771 41 768 147L759 509C756 612 762 614 851 622V650H681L449 149L221 650H57V622C148 613 153 609 144 479L130 271C123 166 117 123 111 88C104 46 85 34 26 28V0H259V28C192 35 169 42 167 90C166 130 166 173 170 256L185 541H187L411 7H431L675 555H679L683 147C683 41 680 35 598 28V0H861Z"></path></g><g transform="matrix(.013,0,0,-0.013,11.583,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g><g transform="matrix(.013,0,0,-0.013,18.577,0)"><path d="M303 0V28C221 34 213 39 213 125V525C213 610 221 616 303 622V650H38V622C120 616 128 610 128 525V125C128 40 120 34 38 28V0H303Z"></path></g><g transform="matrix(.013,0,0,-0.013,23.01,0)"><path d="M631 18C609 24 585 35 559 65C534 91 514 117 478 169C448 214 406 281 389 313C462 346 516 399 516 485C516 545 490 590 449 616C412 641 363 650 290 650H42V622C120 615 128 612 128 527V125C128 40 120 34 38 28V0H300V28C221 34 212 40 212 125V284H244C295 284 312 272 329 244C359 195 395 133 430 84C475 19 516 -3 592 -7C603 -8 615 -8 627 -8L631 18ZM212 316V563C212 591 215 602 223 607C231 613 248 617 277 617C352 617 423 577 423 469C423 415 407 375 368 345C343 324 310 316 260 316H212Z"></path></g><g transform="matrix(.013,0,0,-0.013,31.083,0)"><path d="M495 163C480 117 462 85 444 65C421 39 387 34 332 34C290 34 256 36 236 47C218 57 213 77 213 131V526C213 612 222 616 301 622V650H40V622C122 616 128 611 128 526V126C128 41 120 34 36 28V0H489C498 31 519 126 525 157L495 163Z"></path></g></svg> (second horizontal axis).</div>

Journal of Robotics

fig4

Figure 4

Figure 4: Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach 

Figure 4 | Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach