You are on page 1of 6

16th IEEE International Conference on Robot & Human Interactive Communication

August 26
-

29, 2007 / Jeju, Korea

WAl-6

An Object Recognition Scheme Based on Visual Descriptors for a Smart Home Environment
Seung-Ho Baeg, Jae-Han Park, Jaehan Koh, Kyung-Wook Park, Moon-Hong Baeg
Control and Perception Research Group Korea Institute of Industrial Technology (KITECH) 1271 Sa 1-dong, Sangrok-gu, Ansan, 427-791, South Korea (ROK) e-mail: (shbaeg, hans 1024, jaehanko, kwpark, mhbaeg)@kitech.re.kr
Abstract- One of the functionalities a service robot should have, to work in a smart environment, is object recognition and handling of objects. Many researchers have attempted to make service robots recognize objects in natural environments but no conventional vision systems can recognize target objects in complex scenes. We built a prototype smart environment in our research building at Korea Institute of Industrial Technology (KITECH) to demonstrate the feasibility that a service robot with few sensors can provide reliable services by interacting with an environment full of smart devices through wireless sensor network communications. In this paper, we address the issues of how to develop an object recognition system for our RoboMaidHome project. Our object recognition system can not only recognize objects with no need of a previous training stage but make the service robot light-weight with minimum hardware and software requirements while working well in a smart home environment. In addition, the matching process for object recognition is simple since only a few image processing techniques are employed. Our object recognition scheme on the basis of MPEG-7 visual descriptors is under development and includes color, texture, and shape. It will also be incorporated into our mobile service robotics system.

Index Terms- MPEG-7 Descriptor, Visual Descriptor, Object Recognition, Smart Environment, Ubiquitous Computing.

I. INTRODUCTION

The problems of object recognition and handling along with localization and navigation have been regarded as major challenges to robotic researchers. These functionalities have only been implemented in a restricted environment but have not given a satisfactory performance in a natural environment. To provide reliable services in a natural environment, service robots have been outfitted with many expensive sensors and actuators. As an alternative to this approach, we initiated a smart home environment project, RoboMaidHome, full of inexpensive smart items with radio frequency identification (RFID) tags as well as smart devices with RFID tag readers to build an environment for low-cost service robots. In the environment, the service robots do not need to have many sensors and can perform computationally expensive operations. Instead, most of the tasks of the service robots are performed in collaboration with the environment. To provide reliable services, smart devices and service robots in the environment are connected through wireless communications. We call this sensor network gateway, home server, and it is under development.

One thing the home server cannot provide is a precise location of objects on a smart device due to a characteristic of RFID tags, that is, the fact that they cannot provide directional information. Therefore, the service robots must calculate the exact location by recognizing an object through robot vision, if the task is grabbing it. However, vision-based recognition is not only computationally expensive but also highly dependent on the scale, rotation, and translation of the captured images from the camera. As the recognition process should be carried out in real-time for low-cost service robots to respond to and perform the tasks in collaboration with the environment, we devised our object recognition system based on MPEG-7 visual descriptors for our project. The visual descriptors are designed to describe audiovisual data content in multimedia environments and these descriptors have been used to search, identify, filter and browse audiovisual content [1]. One advantage of using MPEG-7 descriptor is efficiency. To be specific, MPEG-7 visual descriptors provide good clues for locating objects in a visual field and the recognition process with descriptors can be conducted in real-time thereby guaranteeing robustness. The information about objects is expressed in the extensible markup language (XML) format and our object recognition module works on the basis of MPEG-7 visual descriptors. Our approach has the following advantages. First, our system does not require a training phase for object recognition and localization since we receive product information from database on the object information server through wireless communications. Therefore, a brand-new product can be recognized on the basis of RFID information with no problem. RFID tag readers on smart tables and smart shelves detect whether objects are on them and send scanned data from all objects to the environment through sensor networks. Second, the service robot does not have to maintain the object database in it. Instead, the object information server in the environment maintains the database of the products. On the basis of the data read from RFID tags, and the wireless communication between smart objects and the environment, the information about the object for searching can be found and transmitted to the service robot. Thus, the size of the system can be minimized. This can lead to the cost reduction of the robotic system, which is important for the future service robot industry. Third, the matching process of the objects for searching can be simple because our object recognition system uses MPEG-7 visual descriptors, specifically, color, texture, and

978-1 -4244-1635-6/07/$25.00 2007 IEEE.

695

shape. The service robot can download the information about the objects from the OIS which maintains the information of the all products including RFID-related data, in the XML formant and finds the matching objects. Since it does not require some computationally expensive image processing, we hope this scheme can be used in real-time housekeeping tasks. The structure of this paper is as follows. In Section 2, we review related work with special interest to object recognition using MPEG-7 visual descriptors. Then, the overall object recognition system is described in more detail in Section 3. Experimental results are briefly presented in Section 4. Lastly, we give a conclusion and future direction of our system in Section 5.
II. RELATED WORK

have been proposed. McGuire et al. [2] integrated active vision, gestural instruction, and speech input into a robot system in grasping tasks. To be specific, the speech processing module and the attention module produce linguistic and visual/gestural inputs, and they are fed into integration module. Finally the output from the integration module is passed to the manipulator in charge of motion and grasping. The authors intended to enable the robot to communicate with the user in a natural fashion. Nakamura et al. [3,4,5] developed a service robot system that accomplishes tasks given by a user. For the service robot to bring the objects asked by the user, they employed a speech-based interface. In addition, they combined a vision-based interface to recognize gestures of the user. Recently, they focused on the cooperative vision-speech system to recognize objects in complex scenes. Takahashi et al. [6] proposed a human-robot interface method to enhance the capability of robot vision on the basis of communication by verbal and nonverbal behaviors. When the robot is given the verbal command, "bring me that apple," while a user is pointing at an apple, the vision system of the robot attempts to find the apple. If the robot extracts multiple object candidates, it asks the user to choose the correct one among them via speech. This type of interaction is repeated until the user's command is fulfilled. Object recognition and grasping objects are common tasks for service robots. Makihara et al. [7] proposed a method to recognize an object from any direction. By registering object models it starts recognizing target objects. Then, if the robot fails to recognize the object, it tries again with user interaction via speech. Hans et al. [8] introduced the second prototype of

Many researchers in the field of service robotics have tried to build an autonomous mobile service robot system that serves as personal assistants. The robotic system is expected to interact with people in natural environments such as private houses, hospitals, day-care facilities, museums, etc. In these environments, much attention has been given to the means of interface between humans and robots. As a way to promote the interaction capability, multimodal interfaces

Care-O-bot, a mobile service robot that has the capability to perform fetch-and-carry. For simple man-machine communication, speech, haptics, and gestures are considered in the interface design. They also introduced the tasks including household tasks, mobility aid, communication and social integration, a robotic home assistant should perform. Zobel et al. [9] combined vision and speech to improve the capabilities of their autonomous service robot system. MOBSY acts as a mobile receptionist for visitors. It waits at its home position. When a visitor arrives, MOBSY approaches them while introducing itself. After stopping in front of the visitor, it starts a natural-language-based dialogue. When the dialogue is over, MOBSY turns and returns to its initial position. The above approaches focus on providing users with simple and specific services via human-computer interaction and require a robotic system with sophiscated equipment. Since we have built the smart environment for service robots and it provides services through the communication between users and service robots, we need a new object recognition scheme for our environment.
III. OBJECT RECOGNITION SCHEME FOR THE ROBOMAIDHOME PROJECT

Our scheme starts with the following scenario: When a new product is launched, the manufacturer registers the properties of the product by using the annotation tool, visiTag. The annotation tool not only generates visual descriptors information of the image of the product in accordance with MPEG-7 specification automatically but also records the descriptors information in the XML format into the manufacturer's database. When a robot is given a command such as to fetch a can of coke, it talks to the environment through TCP/IP communications to request the object information of the RFID code. Once the information about the object is received, the robot calculates the exact location of the coke can by vision processing. Fig. 1 shows the overall system architecture of our vision system. The vision system consists of the following three components: (i) annotation tool; (ii) object naming server (ONS) and object information server (OIS); and (iii) object recognition system (ORS). The annotation tool, visiTag, is an RFID-based visual information extraction and registration tool. This tool extracts visual descriptors information of each object and the image of the object and then stores them into object description database (ODDB) and object image database (OIDB), re-

spectively. The OIS is connected to its OIDB and ODDB. The ODDB maintains the descriptors information as well as tracking information of products in the XML format. The OIDB keeps image data for products. In our architecture, each company has its own OIS. Whenever a new product is launched, data about the product is inserted and ODDB and OIDB are updated.

696

RID Swgh

(Addr. = 168

OIS for 6

crilption Database and Object


lniage Database
for companyA
mation Server for companyA .= 168.192.232.57)

.. .. .. .. .. ..

__
1B 4

Fig. 1. The system architecture of the vision system for RoboMaidHome The object recognition system for the RoboMaidHome project is crucial for a service robot to provide reliable services since most of the services are carried out in an environment full of RFID-tagged objects. Many expensive tasks such as map building, navigation, object recognition and object handling are conducted through social interactions between the robot and the house. For example, when localizing the robot, it interacts with location sensors in the environment and when identifying intruders, it talks with smart security sensors.
As the need for context-aware mobile robots to perceive environmental signals has increased and the robot must infer the context from the signals and take appropriate actions, it must be outfitted with various sensors to capture the signals related to the state of the environment. It often leads to a complex and expensive robotic system. In contrast to this conventional robotic system, our robotic system for the smart environment is quite light since it is only equipped with a camera, an RFID reader, and a communication module to accomplish the given tasks. Let us think of a realistic scenario in detail based on Fig. 2. Suppose the service robot is asked to fetch a can of coke on the smart table in our smart home environment. When the robot approaches to the smart table, it detects objects (i.e., the coke can) within its read range. After it gets the RFID code, it sends to the ONS the query regarding the IP address of the appropriate OIS. This process is required because each OIS maintains its product information and ONS keeps the information about how to get access to all OISs. In our scenario, the information is the internet protocol (IP) address of the OIS. After the robot gets the server response, it establishes a connection to the appropriate OIS and receives the descriptors information of the matching object from the database management system attached to the OIS. When sending a request to the ONS and OIS, the RFID code of the object is used as a keyword for retrieval. After receiving the visual descriptors data in the XML format, the object recognition process of the service robot is performed. The whole process is depicted in Fig. 2.

(1 JThe robot searches an object with a RFID reader.


(3) The robot sends a request for recognition information. (4) The ONS returns the IP a ddress of the OlS where the RFID code is located. (5) The robot sends a request for object information. (6) The proper QIS returns object information.

(2) The robot reads RFID data.

Fig 2. Task flow diagram for grasping a coke can


A. visiTag: the annotation tool visiTag is an annotation tool for extracting and registering visual descriptors data in accordance with MPEG-7 specification. A snapshot of this tool is shown in Fig. 3. Some fields such as product name, manufacturer, weight, etc., are generated manually and the others including dominant color descriptor and texture descriptor are done automatically. The data generated are stored in the ODDB in the XML format. An example of an object description in the XML format is illustrated in Fig. 4.

OimX
Fl

File Folder 'N

mrg4dIu Ibri kAmito

Fig. 3 visiTag: An RFID-based object annotation tool


The main window of visiTag consists of five sections:

697

object image frame section, the file folder section, the aninformation section, the static data section, and the MPEG-7 visualfeatures section. Data used by visiTag are classified into three categories: static data, instance data, and historical data. Static data indicate fixed or rarely changed data such as manufacturer, weight etc. Instance data are data that are object-specific and describing individual objects such as manufacture date, color data. Historical data are tracking and management data such as RFID reader identification number (indicating who read the item most recently), timestamp (at which the item is read by an RFID reader most recently) etc.
notator
<?xml version="1.0"?> <descriptionDocument> <accessInfo> <repogitoryType>JDBC</repositoryType> </accessInfo> <DBInfo type='staticl instancel historical'> <column> <name>RFIDcode</name> <key> PK</key> <type>varchar</type> <length>50</length> <fieldName>RFID code URN </fieldName> </column> <column> <name>productName</name> <type>varchar</type> <length>30</length> <fieldName>Name of the Product</fieldName> <!-- Additional information goes here. -->

if the system knows the RFID code, it can download the descriptors information for object recognition. Since the system maintains visual descriptors such as dominant color descriptor (DCD), edge histogram descriptor (EHD), and curvature scale space (CSS), the tables for these data are included. Primary keys for the color, shape, and texture tables are not shown in Fig. 5. Data in these tables are accessed via foreign key references. Although the current version of the object recognition system does not use any shape descriptors of the MPEG-7, the shape descriptor table is created for future implementation.
GOPSd,emfe

</column>
</DBInfo> </descriptionDocument>

Fig. 5. An entity-relationship diagram of the Generic Object Description Scheme (PK is short for a primary key for each table)
C. Object Recognition System

Fig. 4 An example of object description information in XML format stored in OIS


B. Object Naming Server (ONS) and Object Information Server (OIS) The ONS provides a means of accessing the appropriate OIS by keeping a list of IP addresses of all OISs. When a robot needs to download the information about detected RFID-tagged objects, it first connects to the dedicated ONS server. Then the ONS searches its list with the RFID code and returns the IP address of the OIS that matches the RFID code. By separating the ONS from OIS, the load on the server is reduced. In addition, the server can manage authorization and access control by granting and denying the access to the associated database. The OIS is actually in charge of connecting to and managing the associated ODDB and the OIDB. Since any incoming request is only accessible to the ODDB and the OIDB through this server, it can protect the system from some vicious attacks by enforcing security requirements. The ODDB also maintains the schema information of the Generic Object Description Scheme (GODScheme) database. As shown in Fig. 5, the schema consists of six distinct tables. The primary key for all three data tables is the RFID code, so

Fig. 6 shows the overall recognition system for our project. On the basis of the captured images from the camera and the RFID signals from the RFID reader, the robot gets access to the OIS to download the object data inserted by the manufacturer of the product. Then the ONS returns the network address for the RFID code. Captured images combined with visual descriptor data are fed into the object recognition system. Candidate rectangles are extracted on the basis of dominant colors specified in the ODDB. Then rectangle removal processing and color-based matching process are carried out. Finally, the EHD data are generated and a similarity matching is performed. Regions of Interests (ROIs) for grasping an object are the final results
IV. EXPERIMENTAL RESULTS
We have developed the object recognition system for our smart environment. For target objects, we use six canned beverages as in Fig. 7: Beautiful, Ceylon Tea, Gatorade, LetsBe, Mango, and Pepsi. We used the two DCDs for color information and the EHD for texture information. Color-based recognition results are compared against those of color and texture.

698

ate Rsj d 1.W

Fig. 6 The object recognition system on the basis of visual descriptors

only fast enough to be used in real-time but is also robust under varying lighting conditions. However, to improve the overall performance ofthe object recognition system, other visual descriptors such as CSS need to be included. In the future In addition, adaptive learning algorithms to work in different environmental settings need to be added and the fine-tuning of the parameters is required to enhance the performance of our method. This scheme will be integrated into the object recognition system of the service robots for our prototype home environment project, RoboMaidHome.
Performance comparison of DCD only against DCD and EHD when the light is off and the distance is at 50 cm.
1 00%

90%

80%

Fig. 7 Objects for the experiments: Beautiful, Ceylon Tea, Gatorade, LetsBe, Mango, and Pepsi (from left to right)
After extracting candidate regions based on dominant colors, texture matching is performed. In the end, the ROI selection is performed. Experiments show that the recognition rate is enhanced considerably when color information (DCD) is supplemented by texture information (EHD) as shown in Fig. 8. For all objects when the distance between the camera and the object is 50 cm, the recognition rate is enhanced by 242% on average. Performance is also evaluated in terms of execution time since the recognition system should run in real-time. For an RFID reader to detect an object, it should be within the reader's read range. Considering a frame resolution of 320 x 240 pixels and the read range, we assume that RFID readers can detect RFID-tagged objects when they are within 2 meters. Thus the following four distinct distances were used for the experiments: 50 cm, 100 cm, 150 cm, and 200 cm. The average execution time for color-based ROI selections was 68 milliseconds as shown in Fig. 9. This figure shows that our scheme is fast enough to be used for our service robotic system in the smart environment. A recognition result when both DCD and EHD are used is shown in Fig. 10. In the example, the target object is the can containing the drink Beautiful.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed an object recognition scheme for our smart home environment built in the research building of KITECH in Ansan, South Korea, on the basis of MPEG-7 visual descriptors, specifically, DCD and EHD. This DCD is well suited to locate the ROI based on the dominant color specified in the XML format stored in ODDB. The EHD captures the spatial distribution of five edge types in each local area called sub-image. By combining the two descriptors along with RFID signals, the vision-based object recognition is enhanced. Experimental results show that the proposed scheme is not

70%
60%

50%

U DCDonly *M DCD+EHD

40%
30%

20%
1 0%

0%
Beautiful Ceylon Tea Gatorade LetsBe Mango Pepsi

Fig. 8 Recognition rate of various target objects

Time (ms) 90 70 50 30 10
50cm

Execution Time

1 00cm

150cm

200cm Distance

Fig. 9 Execution time of the DCD-based object recognition

Fig. 10 The recognized object on the basis of the DCD and the EHD

699

VI. REFERENCES
[1]

[2]

[3] [4] [5]

B. S. Manjunath, P. Salembier, et al., "Introduction to MEPG-7: Multimedia Content Description Interface," John Wiley & Sons, 2002 P. McGuire, J. Fritsch, J.J. Steil, F. Roothling, G.A. Fink, S. Wachsmuth, G. Sagerer, H. Ritter, "Muti-Modal Human-Machine Communication for Instruction Robot Grasping Tasks," IROS 2002, pp. 1082-1089, 2002. M. Yoshizaki, Y. Kuno, and A. Nakamura, "Mutual Assistance between Speech and Vision for Human-Robot Interface, IROS 2002, pp. 1308-1313, 2002. M. Yoshizaki, A. Nakamura, and Y. Kuno, "Vision-Speech System Adapting to the User and Environment for Service Robots," IROS 2003, pp. 1290-1295, 2003. R. Kurnia, S. A. Hossain, A. Nakamura, Y. Kuno, "Object Recognition through Human-Robot Interaction by Speech," Proceedings of the

[6] [7]

[8]

[9]

2004 IEEE International Workshop on Robot and Human Interactive Communication, pp. 619-624, 2004. T. Takahashi, S. Nakanishi, Y Kuno, Y, Shirai, "Human-Robot Interface by Verbal and Nonverbal Behaviors," IROS 1998, pp. 924929, 1998. Y. Makihara, M. Takizawa, Y. Shirai, J. Miura, N. Shimada, "Object Recognition Supported by User Interaction for Service Robots," Proceedings of the 16th International Conference on Pattern Recognition (ICPR'02), vol. 3, 2002. M. Hans, B. Graf, R.D. Schraft, "Robotic Home Assistant Care-0-bot: Past-Present-Future," Proceedings of the 2002 IEEE Intl. Workshop on Robot and Human Interactive Communication (RO-MAN), pp. 380-385, 2002. M. Zobel, J. Denaler, B. Heigl, E. Noth, D. Paulus, J. Schmidt, G. Stemmer, "MOBSY: Integration of vision and dialogue in service robots," Machine Vision and Applications, vol. 14, pp. 26-34, 2003.

700

You might also like