A Robust Convolutional Neural Network for 6D Object Pose Estimation from RGB Image with Distance Regularization Voting Loss

<div>The 3D translation and 3D rotation are estimated through 2D and 3D keypoints correspondences. (b, c) The pixelwise labeling and pixelwise unit vectors field for keypoints voting, respectively. (d, e) The voting process for finding keypoints and calculate the distances among pixels and keypoints that affect the hypotheses. (f, g) The 2D and 3D keypoints correspondences using Perspective-n-Point (PnP), and finally, the 6D poses of objects are estimated (h). (e) <svg height="10.2124pt" id="M7" style="vertical-align:-3.42943pt" version="1.1" viewbox="-0.0498162 -6.78297 12.4578 10.2124" width="12.4578pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M570 304C570 398 525 448 414 448C385 448 343 445 312 434L329 511L321 518C297 504 262 482 244 460L233 411C195 397 159 381 128 358L135 332C160 347 189 360 224 373L111 -147C97 -210 84 -218 17 -231L13 -257L254 -247L259 -218L233 -216C183 -212 177 -202 189 -142L218 -1C238 -10 266 -12 283 -12C351 3 429 48 483 105C543 168 570 242 570 304ZM482 289C482 161 380 33 304 33C278 33 248 51 233 69L303 396C326 400 352 403 369 403C428 403 482 380 482 289Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.384,3.132)"><path d="M389 0V32C297 38 291 46 291 118V635C234 613 175 595 109 583V556L161 554C203 552 207 547 207 497V118C207 46 201 38 110 32V0H389Z"></path></g></svg> and <svg height="10.2124pt" id="M8" style="vertical-align:-3.42943pt" version="1.1" viewbox="-0.0498162 -6.78297 12.4578 10.2124" width="12.4578pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M570 304C570 398 525 448 414 448C385 448 343 445 312 434L329 511L321 518C297 504 262 482 244 460L233 411C195 397 159 381 128 358L135 332C160 347 189 360 224 373L111 -147C97 -210 84 -218 17 -231L13 -257L254 -247L259 -218L233 -216C183 -212 177 -202 189 -142L218 -1C238 -10 266 -12 283 -12C351 3 429 48 483 105C543 168 570 242 570 304ZM482 289C482 161 380 33 304 33C278 33 248 51 233 69L303 396C326 400 352 403 369 403C428 403 482 380 482 289Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.384,3.132)"><path d="M414 144C384 79 371 75 317 75H135L276 221C367 316 408 376 408 465C408 570 327 635 237 635C179 635 131 609 100 575L42 494L67 471C94 510 138 565 205 565C277 565 321 517 321 435C321 348 258 270 195 195C146 137 88 81 33 26V0H411C423 44 433 88 446 135L414 144Z"></path></g></svg> are the pixels, <svg height="9.49473pt" id="M9" style="vertical-align:-0.2063999pt" version="1.1" viewbox="-0.0498162 -9.28833 6.66314 9.49473" width="6.66314pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M480 416C480 431 465 448 438 448C388 448 312 383 252 330C217 299 188 273 155 237H153L257 680C262 700 263 712 253 712C240 712 183 684 97 674L92 648L126 647C166 646 172 645 163 606L23 -6L29 -12C51 -5 77 2 107 8C115 62 130 128 142 180C153 193 179 220 204 241C231 170 259 106 288 54C317 0 336 -12 358 -12C381 -12 423 2 477 80L460 100C434 74 408 54 398 54C385 54 374 65 351 107C326 154 282 241 263 299C296 332 351 377 403 377C424 377 436 372 445 368C449 366 456 368 462 375C472 386 480 402 480 416Z"></path></g></svg> is the keypoint, <svg height="6.1673pt" id="M10" style="vertical-align:-0.2063904pt" version="1.1" viewbox="-0.0498162 -5.96091 7.51131 6.1673" width="7.51131pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M545 106L524 126C493 85 467 65 455 65C438 65 427 113 405 238C448 295 498 362 543 439L533 448L478 435C453 386 423 331 398 295H395C370 404 347 448 282 448C169 448 23 309 23 153C23 54 65 -12 128 -12C203 -12 283 70 339 155H341C360 29 380 -12 411 -12C444 -12 491 11 545 106ZM333 204C265 95 210 54 169 54C137 54 113 96 113 171C113 302 191 405 252 405C301 405 318 306 333 204Z"></path></g></svg> and <svg height="12.7178pt" id="M11" style="vertical-align:-3.42947pt" version="1.1" viewbox="-0.0498162 -9.28833 7.68094 12.7178" width="7.68094pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M558 587C558 666 497 712 432 712C379 712 330 691 284 650C212 586 178 508 148 348L71 -65C49 -185 35 -229 23 -235L27 -261C49 -259 101 -251 124 -224C131 -216 138 -178 159 -24C171 66 197 200 227 356C264 550 295 668 393 668C443 668 479 632 479 575C479 516 446 458 383 418C361 404 344 398 318 397L296 350C395 338 460 281 460 192C460 110 411 40 300 40C258 40 215 55 192 69L181 51C191 18 222 -16 266 -16C308 -16 351 1 397 26C471 67 545 142 545 221C545 315 486 365 401 395C469 437 558 498 558 587Z"></path></g></svg> are the angles between the two ground-truth and estimated unit vectors from pixels to keypoint, <svg height="12.7178pt" id="M12" style="vertical-align:-3.42947pt" version="1.1" viewbox="-0.0498162 -9.28833 33.3811 12.7178" width="33.3811pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M619 670C619 686 593 712 555 712S459 686 410 634S335 504 320 430H250L219 400L222 388H312L258 73C223 -133 201 -166 187 -180C175 -191 158 -199 140 -199C123 -199 88 -188 74 -172C68 -166 63 -164 54 -171C38 -185 23 -201 23 -215C23 -236 60 -261 93 -261C122 -261 161 -247 207 -200C268 -138 300 -71 337 94C365 220 376 277 394 387L501 399L521 430H401C432 623 464 665 501 665C524 665 544 651 567 627C577 617 583 618 592 625C601 631 619 651 619 670Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,6.721,3.132)"><path d="M485 418C485 434 469 451 441 451C389 451 312 387 255 335C221 304 194 278 159 242H157L259 678C264 698 264 710 255 710C242 710 183 681 96 673L90 643L128 642C165 641 171 640 162 602L24 -5L31 -12C51 -4 81 4 110 9L145 181C154 195 180 220 205 242C233 170 263 106 291 53C318 1 335 -12 360 -12C383 -12 425 0 483 82L463 105C437 79 411 57 400 57C388 57 378 68 355 110C329 156 288 244 268 298C299 333 355 376 405 376C426 376 440 373 448 368C453 365 465 368 468 373C477 386 485 405 485 418Z"></path></g><g transform="matrix(.013,0,0,-0.013,11.868,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,16.366,0)"><path d="M570 304C570 398 525 448 414 448C385 448 343 445 312 434L329 511L321 518C297 504 262 482 244 460L233 411C195 397 159 381 128 358L135 332C160 347 189 360 224 373L111 -147C97 -210 84 -218 17 -231L13 -257L254 -247L259 -218L233 -216C183 -212 177 -202 189 -142L218 -1C238 -10 266 -12 283 -12C351 3 429 48 483 105C543 168 570 242 570 304ZM482 289C482 161 380 33 304 33C278 33 248 51 233 69L303 396C326 400 352 403 369 403C428 403 482 380 482 289Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,23.75,3.132)"><path d="M389 0V32C297 38 291 46 291 118V635C234 613 175 595 109 583V556L161 554C203 552 207 547 207 497V118C207 46 201 38 110 32V0H389Z"></path></g><g transform="matrix(.013,0,0,-0.013,28.697,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg> and <svg height="12.7178pt" id="M13" style="vertical-align:-3.42947pt" version="1.1" viewbox="-0.0498162 -9.28833 33.3811 12.7178" width="33.3811pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M619 670C619 686 593 712 555 712S459 686 410 634S335 504 320 430H250L219 400L222 388H312L258 73C223 -133 201 -166 187 -180C175 -191 158 -199 140 -199C123 -199 88 -188 74 -172C68 -166 63 -164 54 -171C38 -185 23 -201 23 -215C23 -236 60 -261 93 -261C122 -261 161 -247 207 -200C268 -138 300 -71 337 94C365 220 376 277 394 387L501 399L521 430H401C432 623 464 665 501 665C524 665 544 651 567 627C577 617 583 618 592 625C601 631 619 651 619 670Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,6.721,3.132)"><path d="M485 418C485 434 469 451 441 451C389 451 312 387 255 335C221 304 194 278 159 242H157L259 678C264 698 264 710 255 710C242 710 183 681 96 673L90 643L128 642C165 641 171 640 162 602L24 -5L31 -12C51 -4 81 4 110 9L145 181C154 195 180 220 205 242C233 170 263 106 291 53C318 1 335 -12 360 -12C383 -12 425 0 483 82L463 105C437 79 411 57 400 57C388 57 378 68 355 110C329 156 288 244 268 298C299 333 355 376 405 376C426 376 440 373 448 368C453 365 465 368 468 373C477 386 485 405 485 418Z"></path></g><g transform="matrix(.013,0,0,-0.013,11.868,0)"><path d="M300 -147C201 -63 143 98 143 270S200 602 300 686L282 710C136 610 70 450 70 271V270C70 89 136 -72 282 -170L300 -147Z"></path></g><g transform="matrix(.013,0,0,-0.013,16.366,0)"><path d="M570 304C570 398 525 448 414 448C385 448 343 445 312 434L329 511L321 518C297 504 262 482 244 460L233 411C195 397 159 381 128 358L135 332C160 347 189 360 224 373L111 -147C97 -210 84 -218 17 -231L13 -257L254 -247L259 -218L233 -216C183 -212 177 -202 189 -142L218 -1C238 -10 266 -12 283 -12C351 3 429 48 483 105C543 168 570 242 570 304ZM482 289C482 161 380 33 304 33C278 33 248 51 233 69L303 396C326 400 352 403 369 403C428 403 482 380 482 289Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,23.75,3.132)"><path d="M414 144C384 79 371 75 317 75H135L276 221C367 316 408 376 408 465C408 570 327 635 237 635C179 635 131 609 100 575L42 494L67 471C94 510 138 565 205 565C277 565 321 517 321 435C321 348 258 270 195 195C146 137 88 81 33 26V0H411C423 44 433 88 446 135L414 144Z"></path></g><g transform="matrix(.013,0,0,-0.013,28.697,0)"><path d="M275 270C275 450 212 609 64 710L45 686C145 604 203 442 203 270S147 -63 45 -147L64 -170C213 -68 275 89 275 270Z"></path></g></svg> are the foot of perpendiculars, and <svg height="12.4698pt" id="M14" style="vertical-align:-3.18147pt" version="1.1" viewbox="-0.0498162 -9.28833 12.2229 12.4698" width="12.2229pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M530 686C535 705 530 712 521 712C504 712 448 684 359 674L358 648H393C437 648 439 646 429 593L400 435C372 447 345 448 332 448C286 448 194 414 144 373C68 311 23 203 23 111C23 26 57 -12 91 -12C120 -12 147 3 188 29C227 54 290 102 341 170H343L322 71C308 6 320 -12 341 -12C373 -12 442 27 501 96L485 120C455 91 422 67 408 67C401 67 401 76 404 91C440 294 479 473 530 686ZM387 375L355 241C326 187 200 53 142 53C126 53 109 73 109 130C109 217 154 337 218 381C240 396 265 404 297 404S372 390 387 375Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.15,3.132)"><path d="M389 0V32C297 38 291 46 291 118V635C234 613 175 595 109 583V556L161 554C203 552 207 547 207 497V118C207 46 201 38 110 32V0H389Z"></path></g></svg> and <svg height="12.4698pt" id="M15" style="vertical-align:-3.18147pt" version="1.1" viewbox="-0.0498162 -9.28833 12.2229 12.4698" width="12.2229pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g transform="matrix(.013,0,0,-0.013,0,0)"><path d="M530 686C535 705 530 712 521 712C504 712 448 684 359 674L358 648H393C437 648 439 646 429 593L400 435C372 447 345 448 332 448C286 448 194 414 144 373C68 311 23 203 23 111C23 26 57 -12 91 -12C120 -12 147 3 188 29C227 54 290 102 341 170H343L322 71C308 6 320 -12 341 -12C373 -12 442 27 501 96L485 120C455 91 422 67 408 67C401 67 401 76 404 91C440 294 479 473 530 686ZM387 375L355 241C326 187 200 53 142 53C126 53 109 73 109 130C109 217 154 337 218 381C240 396 265 404 297 404S372 390 387 375Z"></path></g><g transform="matrix(.0091,0,0,-0.0091,7.15,3.132)"><path d="M414 144C384 79 371 75 317 75H135L276 221C367 316 408 376 408 465C408 570 327 635 237 635C179 635 131 609 100 575L42 494L67 471C94 510 138 565 205 565C277 565 321 517 321 435C321 348 258 270 195 195C146 137 88 81 33 26V0H411C423 44 433 88 446 135L414 144Z"></path></g></svg> are the distances between keypoint and foot of the perpendicular. (a) Input image, (b) pixel labeling, (c) vector field, (d) voting, (e) pixels to keypoints distances, (f) 2D keypoints, (g) 3D keypoints, and (h) 6D object pose.</div>

Scientific Programming

fig1

Figure 1

Figure 1: A Robust Convolutional Neural Network for 6D Object Pose Estimation from RGB Image with Distance Regularization Voting Loss