Professional Documents
Culture Documents
August 26
-
WAl-6
An Object Recognition Scheme Based on Visual Descriptors for a Smart Home Environment
Seung-Ho Baeg, Jae-Han Park, Jaehan Koh, Kyung-Wook Park, Moon-Hong Baeg
Control and Perception Research Group Korea Institute of Industrial Technology (KITECH) 1271 Sa 1-dong, Sangrok-gu, Ansan, 427-791, South Korea (ROK) e-mail: (shbaeg, hans 1024, jaehanko, kwpark, mhbaeg)@kitech.re.kr
Abstract- One of the functionalities a service robot should have, to work in a smart environment, is object recognition and handling of objects. Many researchers have attempted to make service robots recognize objects in natural environments but no conventional vision systems can recognize target objects in complex scenes. We built a prototype smart environment in our research building at Korea Institute of Industrial Technology (KITECH) to demonstrate the feasibility that a service robot with few sensors can provide reliable services by interacting with an environment full of smart devices through wireless sensor network communications. In this paper, we address the issues of how to develop an object recognition system for our RoboMaidHome project. Our object recognition system can not only recognize objects with no need of a previous training stage but make the service robot light-weight with minimum hardware and software requirements while working well in a smart home environment. In addition, the matching process for object recognition is simple since only a few image processing techniques are employed. Our object recognition scheme on the basis of MPEG-7 visual descriptors is under development and includes color, texture, and shape. It will also be incorporated into our mobile service robotics system.
Index Terms- MPEG-7 Descriptor, Visual Descriptor, Object Recognition, Smart Environment, Ubiquitous Computing.
I. INTRODUCTION
The problems of object recognition and handling along with localization and navigation have been regarded as major challenges to robotic researchers. These functionalities have only been implemented in a restricted environment but have not given a satisfactory performance in a natural environment. To provide reliable services in a natural environment, service robots have been outfitted with many expensive sensors and actuators. As an alternative to this approach, we initiated a smart home environment project, RoboMaidHome, full of inexpensive smart items with radio frequency identification (RFID) tags as well as smart devices with RFID tag readers to build an environment for low-cost service robots. In the environment, the service robots do not need to have many sensors and can perform computationally expensive operations. Instead, most of the tasks of the service robots are performed in collaboration with the environment. To provide reliable services, smart devices and service robots in the environment are connected through wireless communications. We call this sensor network gateway, home server, and it is under development.
One thing the home server cannot provide is a precise location of objects on a smart device due to a characteristic of RFID tags, that is, the fact that they cannot provide directional information. Therefore, the service robots must calculate the exact location by recognizing an object through robot vision, if the task is grabbing it. However, vision-based recognition is not only computationally expensive but also highly dependent on the scale, rotation, and translation of the captured images from the camera. As the recognition process should be carried out in real-time for low-cost service robots to respond to and perform the tasks in collaboration with the environment, we devised our object recognition system based on MPEG-7 visual descriptors for our project. The visual descriptors are designed to describe audiovisual data content in multimedia environments and these descriptors have been used to search, identify, filter and browse audiovisual content [1]. One advantage of using MPEG-7 descriptor is efficiency. To be specific, MPEG-7 visual descriptors provide good clues for locating objects in a visual field and the recognition process with descriptors can be conducted in real-time thereby guaranteeing robustness. The information about objects is expressed in the extensible markup language (XML) format and our object recognition module works on the basis of MPEG-7 visual descriptors. Our approach has the following advantages. First, our system does not require a training phase for object recognition and localization since we receive product information from database on the object information server through wireless communications. Therefore, a brand-new product can be recognized on the basis of RFID information with no problem. RFID tag readers on smart tables and smart shelves detect whether objects are on them and send scanned data from all objects to the environment through sensor networks. Second, the service robot does not have to maintain the object database in it. Instead, the object information server in the environment maintains the database of the products. On the basis of the data read from RFID tags, and the wireless communication between smart objects and the environment, the information about the object for searching can be found and transmitted to the service robot. Thus, the size of the system can be minimized. This can lead to the cost reduction of the robotic system, which is important for the future service robot industry. Third, the matching process of the objects for searching can be simple because our object recognition system uses MPEG-7 visual descriptors, specifically, color, texture, and
695
shape. The service robot can download the information about the objects from the OIS which maintains the information of the all products including RFID-related data, in the XML formant and finds the matching objects. Since it does not require some computationally expensive image processing, we hope this scheme can be used in real-time housekeeping tasks. The structure of this paper is as follows. In Section 2, we review related work with special interest to object recognition using MPEG-7 visual descriptors. Then, the overall object recognition system is described in more detail in Section 3. Experimental results are briefly presented in Section 4. Lastly, we give a conclusion and future direction of our system in Section 5.
II. RELATED WORK
have been proposed. McGuire et al. [2] integrated active vision, gestural instruction, and speech input into a robot system in grasping tasks. To be specific, the speech processing module and the attention module produce linguistic and visual/gestural inputs, and they are fed into integration module. Finally the output from the integration module is passed to the manipulator in charge of motion and grasping. The authors intended to enable the robot to communicate with the user in a natural fashion. Nakamura et al. [3,4,5] developed a service robot system that accomplishes tasks given by a user. For the service robot to bring the objects asked by the user, they employed a speech-based interface. In addition, they combined a vision-based interface to recognize gestures of the user. Recently, they focused on the cooperative vision-speech system to recognize objects in complex scenes. Takahashi et al. [6] proposed a human-robot interface method to enhance the capability of robot vision on the basis of communication by verbal and nonverbal behaviors. When the robot is given the verbal command, "bring me that apple," while a user is pointing at an apple, the vision system of the robot attempts to find the apple. If the robot extracts multiple object candidates, it asks the user to choose the correct one among them via speech. This type of interaction is repeated until the user's command is fulfilled. Object recognition and grasping objects are common tasks for service robots. Makihara et al. [7] proposed a method to recognize an object from any direction. By registering object models it starts recognizing target objects. Then, if the robot fails to recognize the object, it tries again with user interaction via speech. Hans et al. [8] introduced the second prototype of
Many researchers in the field of service robotics have tried to build an autonomous mobile service robot system that serves as personal assistants. The robotic system is expected to interact with people in natural environments such as private houses, hospitals, day-care facilities, museums, etc. In these environments, much attention has been given to the means of interface between humans and robots. As a way to promote the interaction capability, multimodal interfaces
Care-O-bot, a mobile service robot that has the capability to perform fetch-and-carry. For simple man-machine communication, speech, haptics, and gestures are considered in the interface design. They also introduced the tasks including household tasks, mobility aid, communication and social integration, a robotic home assistant should perform. Zobel et al. [9] combined vision and speech to improve the capabilities of their autonomous service robot system. MOBSY acts as a mobile receptionist for visitors. It waits at its home position. When a visitor arrives, MOBSY approaches them while introducing itself. After stopping in front of the visitor, it starts a natural-language-based dialogue. When the dialogue is over, MOBSY turns and returns to its initial position. The above approaches focus on providing users with simple and specific services via human-computer interaction and require a robotic system with sophiscated equipment. Since we have built the smart environment for service robots and it provides services through the communication between users and service robots, we need a new object recognition scheme for our environment.
III. OBJECT RECOGNITION SCHEME FOR THE ROBOMAIDHOME PROJECT
Our scheme starts with the following scenario: When a new product is launched, the manufacturer registers the properties of the product by using the annotation tool, visiTag. The annotation tool not only generates visual descriptors information of the image of the product in accordance with MPEG-7 specification automatically but also records the descriptors information in the XML format into the manufacturer's database. When a robot is given a command such as to fetch a can of coke, it talks to the environment through TCP/IP communications to request the object information of the RFID code. Once the information about the object is received, the robot calculates the exact location of the coke can by vision processing. Fig. 1 shows the overall system architecture of our vision system. The vision system consists of the following three components: (i) annotation tool; (ii) object naming server (ONS) and object information server (OIS); and (iii) object recognition system (ORS). The annotation tool, visiTag, is an RFID-based visual information extraction and registration tool. This tool extracts visual descriptors information of each object and the image of the object and then stores them into object description database (ODDB) and object image database (OIDB), re-
spectively. The OIS is connected to its OIDB and ODDB. The ODDB maintains the descriptors information as well as tracking information of products in the XML format. The OIDB keeps image data for products. In our architecture, each company has its own OIS. Whenever a new product is launched, data about the product is inserted and ODDB and OIDB are updated.
696
RID Swgh
(Addr. = 168
OIS for 6
.. .. .. .. .. ..
__
1B 4
Fig. 1. The system architecture of the vision system for RoboMaidHome The object recognition system for the RoboMaidHome project is crucial for a service robot to provide reliable services since most of the services are carried out in an environment full of RFID-tagged objects. Many expensive tasks such as map building, navigation, object recognition and object handling are conducted through social interactions between the robot and the house. For example, when localizing the robot, it interacts with location sensors in the environment and when identifying intruders, it talks with smart security sensors.
As the need for context-aware mobile robots to perceive environmental signals has increased and the robot must infer the context from the signals and take appropriate actions, it must be outfitted with various sensors to capture the signals related to the state of the environment. It often leads to a complex and expensive robotic system. In contrast to this conventional robotic system, our robotic system for the smart environment is quite light since it is only equipped with a camera, an RFID reader, and a communication module to accomplish the given tasks. Let us think of a realistic scenario in detail based on Fig. 2. Suppose the service robot is asked to fetch a can of coke on the smart table in our smart home environment. When the robot approaches to the smart table, it detects objects (i.e., the coke can) within its read range. After it gets the RFID code, it sends to the ONS the query regarding the IP address of the appropriate OIS. This process is required because each OIS maintains its product information and ONS keeps the information about how to get access to all OISs. In our scenario, the information is the internet protocol (IP) address of the OIS. After the robot gets the server response, it establishes a connection to the appropriate OIS and receives the descriptors information of the matching object from the database management system attached to the OIS. When sending a request to the ONS and OIS, the RFID code of the object is used as a keyword for retrieval. After receiving the visual descriptors data in the XML format, the object recognition process of the service robot is performed. The whole process is depicted in Fig. 2.
OimX
Fl
697
object image frame section, the file folder section, the aninformation section, the static data section, and the MPEG-7 visualfeatures section. Data used by visiTag are classified into three categories: static data, instance data, and historical data. Static data indicate fixed or rarely changed data such as manufacturer, weight etc. Instance data are data that are object-specific and describing individual objects such as manufacture date, color data. Historical data are tracking and management data such as RFID reader identification number (indicating who read the item most recently), timestamp (at which the item is read by an RFID reader most recently) etc.
notator
<?xml version="1.0"?> <descriptionDocument> <accessInfo> <repogitoryType>JDBC</repositoryType> </accessInfo> <DBInfo type='staticl instancel historical'> <column> <name>RFIDcode</name> <key> PK</key> <type>varchar</type> <length>50</length> <fieldName>RFID code URN </fieldName> </column> <column> <name>productName</name> <type>varchar</type> <length>30</length> <fieldName>Name of the Product</fieldName> <!-- Additional information goes here. -->
if the system knows the RFID code, it can download the descriptors information for object recognition. Since the system maintains visual descriptors such as dominant color descriptor (DCD), edge histogram descriptor (EHD), and curvature scale space (CSS), the tables for these data are included. Primary keys for the color, shape, and texture tables are not shown in Fig. 5. Data in these tables are accessed via foreign key references. Although the current version of the object recognition system does not use any shape descriptors of the MPEG-7, the shape descriptor table is created for future implementation.
GOPSd,emfe
</column>
</DBInfo> </descriptionDocument>
Fig. 5. An entity-relationship diagram of the Generic Object Description Scheme (PK is short for a primary key for each table)
C. Object Recognition System
Fig. 6 shows the overall recognition system for our project. On the basis of the captured images from the camera and the RFID signals from the RFID reader, the robot gets access to the OIS to download the object data inserted by the manufacturer of the product. Then the ONS returns the network address for the RFID code. Captured images combined with visual descriptor data are fed into the object recognition system. Candidate rectangles are extracted on the basis of dominant colors specified in the ODDB. Then rectangle removal processing and color-based matching process are carried out. Finally, the EHD data are generated and a similarity matching is performed. Regions of Interests (ROIs) for grasping an object are the final results
IV. EXPERIMENTAL RESULTS
We have developed the object recognition system for our smart environment. For target objects, we use six canned beverages as in Fig. 7: Beautiful, Ceylon Tea, Gatorade, LetsBe, Mango, and Pepsi. We used the two DCDs for color information and the EHD for texture information. Color-based recognition results are compared against those of color and texture.
698
only fast enough to be used in real-time but is also robust under varying lighting conditions. However, to improve the overall performance ofthe object recognition system, other visual descriptors such as CSS need to be included. In the future In addition, adaptive learning algorithms to work in different environmental settings need to be added and the fine-tuning of the parameters is required to enhance the performance of our method. This scheme will be integrated into the object recognition system of the service robots for our prototype home environment project, RoboMaidHome.
Performance comparison of DCD only against DCD and EHD when the light is off and the distance is at 50 cm.
1 00%
90%
80%
Fig. 7 Objects for the experiments: Beautiful, Ceylon Tea, Gatorade, LetsBe, Mango, and Pepsi (from left to right)
After extracting candidate regions based on dominant colors, texture matching is performed. In the end, the ROI selection is performed. Experiments show that the recognition rate is enhanced considerably when color information (DCD) is supplemented by texture information (EHD) as shown in Fig. 8. For all objects when the distance between the camera and the object is 50 cm, the recognition rate is enhanced by 242% on average. Performance is also evaluated in terms of execution time since the recognition system should run in real-time. For an RFID reader to detect an object, it should be within the reader's read range. Considering a frame resolution of 320 x 240 pixels and the read range, we assume that RFID readers can detect RFID-tagged objects when they are within 2 meters. Thus the following four distinct distances were used for the experiments: 50 cm, 100 cm, 150 cm, and 200 cm. The average execution time for color-based ROI selections was 68 milliseconds as shown in Fig. 9. This figure shows that our scheme is fast enough to be used for our service robotic system in the smart environment. A recognition result when both DCD and EHD are used is shown in Fig. 10. In the example, the target object is the can containing the drink Beautiful.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed an object recognition scheme for our smart home environment built in the research building of KITECH in Ansan, South Korea, on the basis of MPEG-7 visual descriptors, specifically, DCD and EHD. This DCD is well suited to locate the ROI based on the dominant color specified in the XML format stored in ODDB. The EHD captures the spatial distribution of five edge types in each local area called sub-image. By combining the two descriptors along with RFID signals, the vision-based object recognition is enhanced. Experimental results show that the proposed scheme is not
70%
60%
50%
U DCDonly *M DCD+EHD
40%
30%
20%
1 0%
0%
Beautiful Ceylon Tea Gatorade LetsBe Mango Pepsi
Time (ms) 90 70 50 30 10
50cm
Execution Time
1 00cm
150cm
200cm Distance
Fig. 10 The recognized object on the basis of the DCD and the EHD
699
VI. REFERENCES
[1]
[2]
B. S. Manjunath, P. Salembier, et al., "Introduction to MEPG-7: Multimedia Content Description Interface," John Wiley & Sons, 2002 P. McGuire, J. Fritsch, J.J. Steil, F. Roothling, G.A. Fink, S. Wachsmuth, G. Sagerer, H. Ritter, "Muti-Modal Human-Machine Communication for Instruction Robot Grasping Tasks," IROS 2002, pp. 1082-1089, 2002. M. Yoshizaki, Y. Kuno, and A. Nakamura, "Mutual Assistance between Speech and Vision for Human-Robot Interface, IROS 2002, pp. 1308-1313, 2002. M. Yoshizaki, A. Nakamura, and Y. Kuno, "Vision-Speech System Adapting to the User and Environment for Service Robots," IROS 2003, pp. 1290-1295, 2003. R. Kurnia, S. A. Hossain, A. Nakamura, Y. Kuno, "Object Recognition through Human-Robot Interaction by Speech," Proceedings of the
[6] [7]
[8]
[9]
2004 IEEE International Workshop on Robot and Human Interactive Communication, pp. 619-624, 2004. T. Takahashi, S. Nakanishi, Y Kuno, Y, Shirai, "Human-Robot Interface by Verbal and Nonverbal Behaviors," IROS 1998, pp. 924929, 1998. Y. Makihara, M. Takizawa, Y. Shirai, J. Miura, N. Shimada, "Object Recognition Supported by User Interaction for Service Robots," Proceedings of the 16th International Conference on Pattern Recognition (ICPR'02), vol. 3, 2002. M. Hans, B. Graf, R.D. Schraft, "Robotic Home Assistant Care-0-bot: Past-Present-Future," Proceedings of the 2002 IEEE Intl. Workshop on Robot and Human Interactive Communication (RO-MAN), pp. 380-385, 2002. M. Zobel, J. Denaler, B. Heigl, E. Noth, D. Paulus, J. Schmidt, G. Stemmer, "MOBSY: Integration of vision and dialogue in service robots," Machine Vision and Applications, vol. 14, pp. 26-34, 2003.
700