You are on page 1of 7

iKnowit: Building & Validating a Contextual Database of Object Definitions for Future Service Robots Through an Interactive Game

Abhijeet Uike abhijeetuike@ufl.edu Narayana Perla perlanarayana@ufl.edu Saran Kumar Vellanki sarankvellanki@ufl.edu

Department of Electrical & Computer Engineering, University of Florida, Gainesville, FL32608, USA
AbstractInsufficient & computation-intensive Object Recognition has hindered the realization of Service Robots, despite the fact that service robots can exploit the benefits of active recognition approach, unlike many other applications areas of the object recognition field. This paper builds upon already established theoretical model and extends it into an ecosystem that can be deployed to make Object Recognition easier and faster for service robots in households. A quick glance at things around us can tell all about the objects lying around in a matter of a few seconds. However, similar recognition of objects around a robot in a room will need capturing scores of separate picture frames and application of heavy image processing algorithms to identify each individual item on its own. This is computation intensive and can be reduced with better results. We propose an extension to ecosystem which stands upon a database of contextual definitions of objects organized in local and global contexts, mechanism for their continuous in-field updating and extra hardware (other than 2D camera) to read the required derived attributes defined in the database. The inter-dependence of these parts of the system is essential in realizing the ecosystem. In order to demonstrate the usefulness of proposed database, we have implemented a voice recognition-based interactive game named iKnowit that asks the player, a series of questions in an intelligent manner to guess the object player is thinking about. The idea is that the answers to these questions which currently come from the player in the game will eventually come from the sensors, processing units & definitive database clouds of the robot and it will be able to recognize objects without human help. Additionally the ability to interact with humans using speech alone will enable a normal user without any programming background to aid in crowd-sourced maintaining and improving the dynamic definitive database. Our database aims at mapping each object in human setting with corresponding human activities. Keywords- Artificial Intelligence; Attributes; Contextual Database; Crowdsourcing; Object Recognition; Service Robots; Voice Recognition

dynamic environments (like our kitchen) as shown Figure 2. To address that, robots must recognize a wide variety of objects with unpredictable positions and patterns in the real world [3].

Figure 1. Robots in industrial settings is almost a standard now [2]

Often the challenges in real world computing boil down to finding the right trade-off between using Look-Up table vs. using computational blocks approach. Object Recognition is one of those classic problems and look-up table approach means the use of a database and computational blocks meaning the image processing modules. Over last few decades, a lot of research resources have been invested in improving the latter and we believe that the former should needs to improve significantly and a proper hybrid of the two is the solution to this problem. Major challenges in object recognition have often been attributed to limitations of algorithms and models in image processing of 2D pictures in passive setting[11]. Problems like occlusions, skewed viewpoints, asymmetric illumination, lack of comprehensive object classes/models, computation-intensive image processing techniques are as old as the study of the field itself. Addressing them entirely seems to have been sole agenda in object recognition[4]. This paper proposes a system level approach to aid in addressing Object Recognition for service robotic systems of not-so-distant-future.

I.

INTRODUCTION

Robots have taken over modern car assembly lines and factory floors for a long time now [1][2](Figure 1) but having service robots in our households that can understand a variety of different objects and how humans interact with them has not been realized yet. The commercially available service robots like Roomba iCreate from iRobot exist but they hardly tackle object recognition. The main reason behind is that though robots are impeccably perfect in operating through a fixed & predictable environment (like an assembly line), they are not capable of dealing with

Figure 2. Identifying bottle

The challenges involved in conventional approach include requirement of high computing due to image processing overload, large sample space of possible fits for a given objects and incomplete definitions of objects. The current systems translate the problem of Object Recognition into that of Image Processing alone and attribute the failure of recognition to limitations of these approaches. A conventional standard approach to identify objects in a given setting can be depicted as below in Figure 3.

Figure 5. Dimensions of the components of a chair

We have proposed a new scale HuSca (relative Human Scale) that takes into account the sizes of various human bodily dimensions, which can be treated as variables with approximate minimum and/or maximum bounds. These variables are in fact responsible for ergonomics of most, if not of all the man-made objects around us. There is a reason why washing machines are deeper than that of kitchen sink but generally are not as tall as a refrigerator. The reason that a washing machine maker keeps the depth of his unit approximately less than average length of human arm is that ones hand should reach the bottom of it while in use. It keeps the upper bound in check. Lower bound is governed by its utility. This and few other aspects of an object were incorporated in the development of our database. This will be discussed in detail in the implementation section. We strongly believe that Object Recognition for a machine is a system level problem. What we mean by that is the definitions & functionalities of many objects around us change over time needing regular updating of them. This would be better addressed with an interactive system

Figure 3. Conventional approach to Object Recognition

We have built a database that will be like a dictionary of objects but more understandable for machines than conventional dictionaries available. For example, the Merriam-Webster dictionary defines a chair as a seat typically having four legs and a back for one person. This definition is for humans and is of no help for machines. According to the available definition -

Figure 4. False recognition due to incomplete (for-humans-only) definitions

A better definition for a machine would be an elevated surface with a vertical surface as a backrest & dimensions as follows.

Figure 6. Proposed components of the extended Object Recognition ecosystem for a Service Robot

capable of incorporating new definitions in easiest manner. Also, solving classic OR problem at the system level can benefit from some real world cues, which are discussed in the later part of this section. First, we present our vision of

an ecosystem for a service robot to identify the objects in its standard operating environment.

a normal 2D camera. This assembly can segregate objects based on their relative position in the captured frame. D. Voice based interaction capability with users in SOE Because machines will be operating in human settings, it is very useful if they can seek human help in understanding the objects better and train with better definitions. This can seamlessly work with voice recognition and understanding. This is discussed in later sections. Some of the key elements from real world that can be helpful in OR are as follows a) Using local database for repetitive identification Robot will live in a setting that will be mostly a nonchanging environment and the scenarios or the object placements will vary significantly but individual object variation will be minimalistic. This eliminates/greatly reduces the need of having to employ costly image processing for the objects repetitively before narrowing the possible fits in the object classes. The robot doesn't need to refer to a global cloud of model/class definitions to identify same local objects over and over again. Only the first time, it will refer to global cloud and saves the attributes of given instance of an object class in local cloud. For example, a chair in the house can have different shape from all other chairs identified in the global database/class models. System need not focus on the models instead it needs to identify the local instance of that class (chair in its surrounding in a better manner i.e. despite occlusions, change in position or occasional deformations). Identifying the given chair in the setting of the robot is more important than trying to fit every given position/geometry of the chair into the global class called chairs. b) Gauging volume before applying shape based models The current techniques stress more on the shape/outline of objects and less on the scale or size of the object with respect to the given setting. There is hardly any difference in the shape outline of a pencil, a punching bag, pipe, rolled up carpet and a chimney. They all are cylindrical in shape. Spending unnecessary computational resources in identifying amongst them using only image processing is inefficient. We have defined a term called HuSca (relative Human Scale). The sizes of these similarly shaped objects greatly vary and so do their interactions with humans. These can be mapped with HuSca to differentiate between them before applying image processing. c) Crowdsourcing approach to update the database

Figure 7. Different objects - Similar Shape Outline

Similar shape outline for Carpet roll, Italian Bread and Metal Pipe makes it difficult to identify each object from each other if only 2D image is used. Our approach is to first use depth map data to gauge the approximate volume. Use prior knowledge of surrounding (context) based on geographical location (Mart or home or bakery) & already identified objects in the given frame to shorten sample space of possible objects. Then test other identifier attributes for each of the object in the reduced sample space in the current scene and taking more information from the scene if required. Obtain cues from the scene if object is being interacted with in the scene. If the object was already recognized before, use the secondary yet easily identifiable attributes like color or texture to identify parts of the objects to recognize it. The Ecosystem of our vision is divided into following parts A. Global cloud of object definitions This global cloud holds the database of definitions in the global, overall context, which represents generic understanding associated with an object. For example, a cellphone is defined as a unit roughly the size of human palm/fist with a screen, capable of communication and display. Such a definition would not help the local robot identify an iPhone 4 or a HTC Sensation in a household. But if identified, will help understand what it does. The huge intra-class variations would not force for more generic definitions that will make OR difficult on global definitions alone. However, a 2-phase definition (or the class-instance model) of global & local definition would help solve this. B. Local cloud of object definitions This cloud will hold definitions of objects for a local context that will help system identify a given object in a better manner. For example, a definition of a couch in global context will not include color or texture but a couch with black color and leather texture in a given robot's SOE will help it identify that couch easily over and over again. What's more is that even if the significant part of the couch is not visible, a guess can be made solely based on secondary attributes defining this object like color or texture once initial identification was done for the robot. C. Hardware capable of gauging size/volume & depth Most techniques use 2D images to identify objects and it has been a major area of study despite some drawbacks. Depth map will help solve few of these problems like occlusion. This can be achieved with the use of stereo camera system or depth-map [5] (time of flight) camera with

System understanding can be corrected from easy upgrading of definitions by simply asking the right questions to any human user around the mobile robot. The voice recognition and understanding system is very useful and we propose to make it part of the ecosystem. It can contribute and concurrently benefit from object recognition methods improvement. 3

II.

SYSTEM ARCHITECTURE

A. Why build the game and not the system? The vision of a complete system capable of Object Recognition presented in this paper is mostly like a chickenn-egg problem. Without all parts in place, it will be difficult to demonstrate its complete operation. However, in order to test and receive feedback about the usefulness of the built database, we have developed a voice-based interactive Android game. System uses database to respond. B. Implementation of the game involved following steps a) Database Creation It starts with an obvious observation - Every object around us is defined by its attributes that help in serving its core functionality. Barring objects associated with sports and art, most things about an object have some specific purpose. For example, a water bottle is held in hands so it cannot have diameter more than you can grip around with your palm. Its opening cannot be much bigger than the orifice of human mouth. So the diameter of it can be considered as one of the defining attribute. The opening can be considered other attribute. Objects will have other attributes too color, texture, smell form common attributes for most objects. We classify attributes of a given object into three types - defining attributes, attributes useful for object class identification, attributes useful for object instance identification. We also map these objects with how they interact with humans on a very preliminary level. There will be millions of object classes, we cannot afford to use image-processing algorithms solely based on finding the outline of an object to identify an object. A combination of contexts [6][7], global & local definition clouds, stereovision [8], voice interaction ability of machine will help solve this problem. Figure 11 demonstrates 3 examples to understand this and visualize how a service robot will identify objects. b) State Variable Its a local attribute of a given object that will help machine understand some very basic things about the objects. These are not limited to solid or liquid states as perceived by humans but derived states that are more meaningful for machines. For example, objects that can hold other objects are many. From kitchen vessels to cars are in fact containers of some sort. Defining this as a part of state has significant advantages. Some objects exist as a collection, like salad or curry and defining their state as collective one helps system understand them better. We have defined around 20 such traits under the state of an object category and have received very high segregation rates in our game. c) HuSca (Human Scale) As pointed out earlier, the ergonomics of the objects around us is based on the scale of human body parts that interact with them. We have specified an attribute for such scales. This attribute helps us classify objects in proper

categories that can be used for identification. Few examples of such HuSca attributes are Palm length and Grip diameter.

Figure 8. Various dimensions of the object on Human Scale will help robots identify them better

In the following pie chart each slice shows the reduced sample space to be passed to image processing algorithms instead of passing entire set.
GripDia FingGrip PalmLen Vess ShouDist FingLen WaistDia MouDia ForeLen FeetLen HumanHei ArmLen FingDia FaceBre HeadDia Figure 9. Reduction in sample space using HuSca IP

d) Human Activities It's an observation that the way humans interact with objects can be broadly mapped to very basic categories. This can be useful in reverse mapping the objects from the captured frames, which will most likely involve humans as well. The categories are as follows 1. Consumption of objects 2. Perception of information from objects 3. Wearable objects 4. Objects that can be used as tools

Figure 10. Captured images with humans interacting with Objects can offer a set of clues for Object Recognition for the Robot.

Figure 11 Examples of how recognition system would work

III.

DESIGN OF THE GAME

A voice recognition-based interactive game is developed to demonstrate the capabilities of the database and also to receive feedback to improve it. The parts of the game are as follows 1. Voice recognition module - we have used the RecognizerIntent [10] module from Android API for this. 2. Cellphone-server connection - using JDBC 3. Database server - using MySQL application and SQL as the query language. 4. Text to Voice unit - to present the questions from the machine to the user

database running at the back-end. The query may return more than one object and the objects are passed to the application running on the phone and read aloud for user to listen. User can confirm or deny. The entire process is reiterated till system arrives at the right conclusion. The algorithm for the game can be represented in the flowchart below.

Figure 12. Game Interface

Figure 14 Algorithm used for the game

IV.

CONCLUSION

Figure 13. Sample Conversation between User (U) and App (A) assuming that the user is thinking of a 'Computer Mouse'.

A user thinks of an object inside the house. He commences the game. The app on the Android phone has Voice Recognition module. The game system checks the database and asks a question based on broader categories defined. Questions can be of two types - Yes/No or subjective. Based on the user's replies, the game runs a query on the SQL

The project successfully demonstrated that the introduction of contextual database will certainly help make object recognition systems better and less expensive in terms of computations. The systems will also benefit greatly by taking advantage of local context that would make the repetitive identification of objects present in the standard operative environment of the service robot faster. Additionally the provision of updating the database in the field using crowdsourcing approach will make the ecosystem complete.

References [1] Executive Summary of World Robotics 2011 (Industrial & Service) Report, International Federation of Robotics [2] Industrial Robot:An International Journal, Emerald Article: DaimlerChrysler installs new robot based flexible assembly line, 2008 [3] M.Treiber, "An Introduction to Object Recognition", 2010 [4] M.Bennamoun & G. Mamic, "Object Recognition: Fundamentals & Case Studies, Chapter One", 2002 [5] S.Helmer & D. Lowe, "Using Stereo for Object Recognition", [6] A. Torralba, "Conextual Priming for Object Recognition", 2003 [7] A. Oliva & A.Torralba, "The role of context in Object Recognition". ScienceDirect [8] Walter Wohlkinger, Markus Vincze, Vienna University of Technology, Austria. 2010."3D Object Classification for Mobile Robots in Home Environments using Web Data", 19th Int'l Workshop on Robotics in Alpe-Danube Region-RAAD 2010.June 23-25 2010, Budapest, Hungary [9] Luis von Ahn, Ruoran Liu, and Manuel Blum. 2006."Peekaboom: a game for locating objects in images". In Proceedings of the SIGCHI conference on Human Factors in computing systems (CHI '06) [10] http://developer.android.com/reference/android/speech /RecognizerIntent.html [11] A.Selinger & R.Nelson, "Appearance-Based Object Recognition Using Multiple Views", 2001

You might also like