| Input: User_speeches, User_body gestures, User_hand gestures, final_task, M(I, subtask), M(subtask, motion) |
| Initialize: NLP, sub_classifier, memory M, episode←0, load θ, Sub_classifier (User_speeches, User_body gestures, User_hand gestures), replace_ iter |
| Output: Motionrobot. |
| While not finishing final_task do: |
| s ← Sub_classifiers |
| With probability ε to select a random intention i |
| Otherwise use equation (1) to calculate i subtask ← M(i, subtask) |
| Motion ← M(subtask, motion) |
| Motionrobot ← Motion − Motionuserr ← NLP (feedback_speech) |
| //s′ is the next behavior feature of User after robot executes Motionrobot |
| s′ ← Sub_classifiers after Robote executes (Motionrobot) |
| Calculate Reward rt according to equation (2) |
| M ← (s, i, r, s′) batch_memory ← random choice (M) |
| If s means the end of collaboration: |
| y′ ← r |
| Else: |
| Use equation (3) to calculate y′ |
| Use equation (4) to calculate loss |
| Minimize loss |
| If (episode > replace_ iter): |
| θ− ← θ |
| End |