You are on page 1of 28

Ben-Gurion University of the Negev Faculty of Engineering Sciences Dept.

of Industrial Engineering and Management

An Integrated Project: Multimedia, Machine Vision and Intelligent Automation Systems

Hand Gesture Telerobotic System using Fuzzy Clustering Algorithms


Wachs Juan & Kartoun Uri Supervisors: Prof. Stern Helman & Prof. Edan Yael
Summer 2001

Outline 1. Introduction 1.1 Description of the Problem 1.2 Objectives 2. Literature Review

2.1 Telerobotics 2.2 Gesture Recognition 3. System Design and Architecture 3.1 Robot 3.2 Video Capturing Card 3.3 Web Camera 3.4 Video Imager 4. Methodology 4.1 Choosing the Programming Language 4.2 Developing a Hand Gesture Language 4.3 Developing and Using Algorithms 4.4 Building Interface with Full Capability 4.5 Building the User-Robot Communication Link 4.5.1 Background on Communication Architecture and Protocols 4.5.2 Integrating the Camera into the System 5. Testing the System 5.1 Task Definition 5.2 Experimental Results 6. Conclusions 6.1 Summary of Results 6.2 Future Work 7. Bibliography

1. Introduction 1.1 Background In todays world the Internet plays an important role in everyones lives. It provides a convenient way for receiving information, electronic communication, entertainment and conducting business. Teleoperation is the direct and continuous human control of a teleoperator [Sheridan, 1992]. Robotics researchers are now using the Internet as a tool to provide feedback for teleoperation. Internet-based teleoperation will inevitably lead to many useful applications in various sectors of society. To understand the meaning of teleoperation, the definition of robotics is examined first. Robotics is the science of designing and using robots. A robot is defined as a reprogrammable multi-functional manipulator designed to move materials, parts, tools or specialized devices through variable programmed motions for the performance of a variety of tasks [Robot Institute of America, 1979]. Robots can also react to changes in the environment and take corrective actions to perform their tasks successfully [Burdea and Coiffet, 1994]. Furthermore, all electromechanical systems, such as toy trains, may be classified as robots because they manipulate themselves in an environment. One of the difficulties associated with teleoperation is that the human operator is remote from the object being controlled; therefore the feedback data may be insufficient for correct control decisions. Hence, a telerobot is described as a form of teleoperation in which a human operator acts as a supervisor, intermittently communicating to a computer information about goals, constraints, plans, contingencies, assumptions, suggestions and orders relative to a limited task, getting back information about accomplishments, difficulties, concerns, and as requested, raw sensory data - while the subordinate robot executes the task based on information received from the human operator plus its own artificial sensing and intelligence [Earnshaw et al., 1994]. A teleoperator can be any machine that extends a persons sensing or manipulating capability to a location remote from that person. In situations where it is impossible to be present at the remote location, a teleoperator can be used instead. These situations may be caused by the hostile environment such as a minefield, or at the bottom of the sea or simply at a distant location. A teleoperator can replace the presence of a human in hazardous environments. To operate a telerobot, new technology must emerge to take advantage of complex new robotic capabilities while making such systems more user-friendly. Robots are intelligent machines that provide service for human beings and machines themselves. They operate in dynamic and unstructured environment and interact with people who are not necessarily skilled in communicating with robots [Dario et al., 1996]. A friendly and cooperative interface is thus critical for the development of service robots [Ejiri, 1996, Kawamura et al., 1996]. Gesture-based interface holds the promise of making human-robot interaction more natural and efficient. Gesture-based interaction was firstly proposed by M. W. Krueger as a new form of human-computer interaction in the middle of the seventies [Krueger, 1991], and there has been a growing interest in it recently. As a special case of human-computer interaction, human-robot interaction using hand gestures, has a number of characteristics: the background is complex and dynamic; the lighting condition is variable; the shape of the human hand is deformable; the implementation is required to be executed in real time and the system is expected to be user and device independent [Triesch and Malsburg, 1998]. Humans naturally use gestures to communicate. It has been demonstrated that children can learn to communicate with gestures before they learn to talk. Adults use gestures in many situations, to accompany or substitute for speech, to communicate with pets, and occasionally to express feelings. Gestural interfaces, where the user employs hand gestures to trigger a desired manipulation action, are the

most widely used type of direct manipulation interface, and provide an intuitive mapping of gestures to actions [Pierce et al., 1997, Fels and Hinton, 1995]. For instance, forming a fist while intersecting the hand with a virtual object might execute a grab action. Gesture recognition is human interaction with a computer in which human gestures, usually hand motions, are recognized by the computer. The potential and mutual benefits that Robotics and Gesture Recognition offer each other is great. Robotics is beneficial to Gesture Recognition in general by providing haptic interfaces and human factors know-how. Space limitations precluded other interesting application areas, such as in medical robotics and in microrobotics. Since Gestutre Recognition is a younger technology than Robotics, it will take some time until its benefits are recognized, and until some existing technical limitations are solved. Full implementation in telerobotics, manufacturing and other areas will require more powerful computers than presently exist, faster communication links and better modeling. Once better technology is available, usability, ergonomics and other human-factors studies need to be done, in order to gauge the effectiveness of such systems. 1.2 Objectives The fundamental research objective is to advance the state-of-the-art of Telerobotic Hand Gesture based systems. As for all technologies, but more importantly for a much emphasized and complex technology such as Gesture Recognition, it is important to choose appropriate applications with well-defined functionality objectives. It is also important to compare the abilities of Gesture Recognition with competing technologies for reaching those objectives. This ensures that the solution can be integrated with standard business. The main objective of this project is to design and implement a telerobotic system using Hand Gesture control. By using Hand Gesture control, an operator may control a real remote robot performing a task. The operator can use a unique Hand Gesture Language to place the robot desired. Specific objectives are to:

Develop and evaluate a Hand Gesture language and interface. Test and validate new strategies and algorithms for image recognition. Demonstrate the use of the Hand Gesture in a telerobotic system.

2. Literature Review 2.1 Telerobotics A telerobot is defined as a robot controlled at a distance by a human operator. [Durlach and Mavor, 1994]. Sheridan, 1992, makes a better distinction, which depends on whether all robot movements are continuously controlled by the operator (manually controlled teleoperator), or whether the robot has partial autonomy (telerobot and supervisory control). By this definition, the human interface to a telerobot is distinct and not part of the telerobot. Telerobotic devices are typically developed for situations or environments that are too dangerous, uncomfortable, limiting, repetitive, or costly for humans to pe. Some applications are listed in

[Sheridan, 1992] such as: Underwater - inspection, maintenance, construction, mining, exploration, search and recovery, science, surveying; Space - assembly, maintenance, exploration, manufacturing, science; Resource industry - forestry, farming, mining, power line maintenance; Process control plants - nuclear, chemical, etc., involving operation, maintenance, emergency; Military - operations in the air, undersea, and on land; Medical - patient transport, disability aids, surgery [Sorid and Moore, 2000], monitoring, remote treatment; Construction - earth moving, building construction, building and structure inspection, cleaning and maintenance; Civil security - protection and security, fire-fighting, police work, bomb disposal. Telerobots may be remotely controlled manipulators or vehicles. The distinction between robots and telerobots is fuzzy and a matter of degree. Although the hardware is the same or is similar, robots require less human involvement for instruction and guidance as compared to telerobots. There is a continuum of human involvement, from direct control of every aspect of motion, to shared or traded control, to nearly complete robot autonomy, but yet, robots are perform poorly when adaptation and intelligence are required. They do not match the human sensory abilities of vision, audition, and touch, human motor abilities in manipulation and locomotion, or even the human physical body in terms of a compact and powerful musculature portable source. Hence, in recent years many robotics researchers have turned to telerobotics. Nevertheless, the longterm goal of robotics is to produce highly autonomous systems that overcome difficult problems in design, control, and planning. By observing what is required for successful human control of a telerobot, one may infer what is needed for autonomous control. Furthermore, the human represents a complex mechanical and dynamic system that must be considered. More generally, telerobots are representative of man-machine systems that must have sufficient sensory and reactive capability to successfully translate and interact within their environment. In the future, educators and experimental scientists will be able to work with remote colonies of taskable machines via a remote science paradigm [Cao et al., 1995] that allow: (a) multiple users in different locations to share collaboratively a single physical resource [Tou et al., 1994], and (b) enhanced productivity through reduced travel time, enabling one experimenter to participate in multiple, geographically distributed experiments. The complexity of kinematics must be hidden from the user while still allowing a flexible operating envelope. Where possible, extraneous complications should be filtered out. Interface design has a significant effect on the way people operate the robot. This is born out by differences in operator habits between the various operator interfaces. This is consistent with interface design theory where there are some general principles that should be followed, but good interface design is largely an iterative process [Preece et al., 1994]. With the introduction of web computer languages (e.g., Java) there is a temptation to move towards continuously updating robot information (both images and positions). Where possible, the data transmitted should be at a minimum low bandwidth, but relevant information is much more useful than high bandwidth irrelevant data. Graphical models of the manipulator and scene are known to improve performance of telerobotic systems. [Browse and Little, 1991] found a 57% reduction in the error rate of operators predicting whether a collision would occur the manipulator and a block. Future developments will extend this ability to view moves via a simulated model and to plan and simulate moves before submission. Telerobotics was also been implemented in a client-server architecture, e.g. one of the first successful World Wide Web (WWW) based robotic projects was the Mercury project [Goldberg et al., 1995]. This later evolved in the Telegarden project [Goldberg et al., 1995], which used a similar system of a SCARA manipulator to uncover objects buried within a defined workspace. Users were able to control the position of the robot arm and view the scene as a series of periodically updated static images. The University of Western Australia's Telerobot experiment [Taylor and Dalton, 1997, Taylor and Trevelyan, 1995] provides Internet control of an industrial ASEA IRB-6 robot arm through the WWW. Users are required to manipulate and stack wooden blocks and, like the Mercury and Telegarden projects, the view of the work cell is limited to a sequence of static images captured by cameras located around the workspace. On-line access to mobile

robotics and active vision hardware has also been made available in the form of the Netrolab project [McKee, 1995]. Problems with the static picture can be avoided by using video technology, which is becoming more and more popular in the Internet domain. Video is one of the most expressive multimedia applications, and provides a natural way of information presentation which results in stronger impact than images, static text and figures [Burger, 1993]. Because transfer of video demands high bandwidth capacity, different methods of video transmission are used. Lately streaming video [Ioannides and Stemple, 1998] applications such as Real Video, QuickTime Movie and Microsoft Media Player, are commonly used in Internet. These applications provide the ability of transmitting video recording over connections with small bandwidths such as telephone connection standards. The previously mentioned projects rely on cameras to locate and distribute the robot position and current environment to the user via the WWW. It is clear that such an approach needs a high-speed network to achieve on-line control of the robot arm. Data transmission times across the world wide web depend heavily on the transient loading of the network, making direct tele-operation (the use of cameras to obtain robot arm position feedback) unsuitable for time critical interactions. Rather than allowing the users to interact with the laboratory resources directly, as in many examples, users are required to configure the experiments using a simulated representation (a virtual robot arm and its environment) of the real-world apparatus. This configuration data is then downloaded to the real work-cell, for verification and execution on the real device, before returning the results to the user once the experiment is complete. 2.2 Gesture Recognition Ever since the put-that-there system was demonstrated which combined speech and pointing [Bolt, 1980], there have been efforts for enabling human gesture input. Many of them investigated visual pattern interpretation problem. Most of them, however, dealt with sign language recognition problem [Starner and Pentland, 1995, Waldron and Kim, 1995, Vogler and Metaxas, 1998]. As one of a few notable exceptions, the overall framework for recognizing natural gestures is discussed [Wexelblat, 1995]. His method finds lowlevel features and those are fed into path analysis and temporal integration step. There were, however, no discussions about the concrete model for analysis and for temporal integration method of features. When the patterns are complex, which is continuous and have multiple-strokes, constructing a good analyzer is itself a big problem. Also, the low-level analysis cannot be fully apart from the high-level model or context. On the other hand, gestures were used to support computer-aided presentations [Baudel, 1993]. The interaction model was defined by a set of simple rules and predetermined guidelines. Any serious attempt to interpret hand gestures should begin with an understanding of natural human gesticulation. The physiology of hand movement from the point of view of computer modeling was scoped [Lee and Kunii, 1995]. A range of motions of individual joints and other characteristics of the physiology of hands was analyzed by them. Their goal was to build an accurate internal model where the paramecould be set from imagery. Francis Quek of the University of Michigan has studied natural hand gestures in order to gain insights that may be useful in gesture recognition [Quek, 1993]. For example, he makes the distinction between inherently obvious (transparent) gestures that need little or no pre-agreed interpretations, and iconic (opaque) gestures. He observes that most intentional gesticulation generally is, or soon becomes, iconic. Other classifications are whether a gesture is indicating a spatial concept or not, and whether it is intentional or unconscious. He indicates that humans typically use whole hand or digit motion, but seldom both at once. He suggests that if a hand moves from a location, gesticulates, then returns to the original position, the gesticulation was likely to

be a gesture rather than an incidental movement of the hand. Spontaneous gesticulation that accompanies speech was scoped by McNeill [McNeill, 1992]. He discusses how it interacts with speech, and what it can tell us about language. A basic premise is that gestures complement speech, in that they capture the holistic and imagistic aspects of language that are conveyed poorly by words alone. Most applicable here, he sites work that indicates that gestures have three phases: preparation, where the hand rises from its resting position and forms to the shape that will be used; stroke, where the meaning is conveyed; and retraction, where the hand returns to its resting position. Cassell [Cassell et al., 1994] uses these types of observations about spontaneous gesture to drive the gestures of animated characters in order to test specific theories about gesticulation and communication. She has also been a part of gesture recognition projects at the MIT Media Lab [Wilson et al., 1996]. Until a few years ago, nearly all work in this area used mechanical devices to sense the shape of the hand. The most common device was a thin glove mounted with bending sensors of various kinds, such as the DataGlove. Often a magnetic field sensor was used to detect hand position and orientation. This technique for having problems in the electro-magnetically noisy environment of most computer labs. Researchers reported successful data glove type systems in areas such as CAD (Computer Added Manufacturing) design, sign language recognition, and a range of other areas [Sturman and Zeltzer, 1994]. However, in spite of readily availability of there sensors, interest in glove-based interaction faded. More recently the increasing computer power available and advances in computer vision have come together with the interest in virtual reality and human centric computing to spark a strong interest in visual recognition of hand gestures. The interest seems to come from many sectors. The difficulty of the problems involved make it a good application domain for vision researchers interested in recognition of complex objects [Cui and Weng, 1995, Kervrann and Heitz, 1995]. The freedom from the need for interface devices has the interest of those working on virtual environments [Darrell and Pentland, 1995]. The potential for practical applications is sparking work on various aspects of interface design [Freeman and Weissman, 1995, Crowley et al., 1995]. A stereo vision and optical flow system for modeling and recognizing people gestures was developed in [Huber and Kortenkamp, 1995]. The system is capable of recognizing up to six distinct pose gestures, such as pointing or hand signals, and then interpreting these gestures within the context of intelligent agent architecture. Gestures are recognized by modeling the person's head, shoulders, elbows, and hands as a set of proximity spaces. Each proximity space is a small region in the scene measuring stereo disparity and motion. The proximity spaces are connected with joints/links that constrain their position relative to each other. A robot recognizes these pose gestures by examining the angles between links that connect the proximity spaces. Confidence in a gesture is build up logarithmically over time as the angles stay within limits for the gesture. This stereo vision and optical flow system runs completely onboard the robot and has been integrated into a Reactive Action Package (RAP). The RAP system takes into consideration the robot's task, its current state, and the state of the world. When these considerations are taken care of, the skills that should run to accomplish the tasks are selected. The Perseus system [Kahn et al., 1996] addresses the task of recognizing objects people pointed at. The system uses a variety of techniques, called feature maps (such as intensity feature maps, edge feature maps, motion feature maps, etc.). The objective of these maps is to solve this visual problem in nonengineered worlds reliably. Like the RAP reactive execution system, Perseus provides interfaces for symbolic higher

level systems.

3. System Design and Architecture Figure 1 describes the system architecture.

Figure 1. The System Architecture

Figure 2 describes the system flow diagram for remote tasks performing.

Figure 2. System Flow Chart

The operational behavior of the system is described in eight major steps:

A visual gesture recognition language was developed. A robot-accomplished tasks guided by a sequence of human hand gestures through the proposed visual gesture recognition was implemented. A training process was in which several handreds of hand pose images are inserted into a database was implemented. That includes all possible system hand gestures. Every picture got identification number, and other features, such as height and width. At this stage, for each one of the pictures, a vector, which contains 13 parameters, was built. These vectors contain the height / width (aspect ratio) and the gray scale levels of sub blocks. A Fuzzy C-Means clustering process was implemented. During this process, membership functions were built for each gesture image. These membership functions are relative to the number of signs in the language. For each gesture image, a membership vector was built. In this stage, a performance index (cost function) was built. This function is built for estimating how optimal the system is, from the aspect of building the clusters. The user controlled a robot in real time by hand gestures. In this step a real time image processing

process was used.


The user got visual feedback from the remote robotic scene. The user performed a task, in which a yellow wooden box, placed on a plastic cup structure, removed, using hand gestures and visual feedback only.

3.1 A255 Robot Industrial robots are software programmable, general-purpose manipulators that typically perform highly repetitive tasks in a predictable manner. The function of an industrial robot is determined by the configuration of the arm, the tool and the motion control hardware and software. Pick and place, assembly, spot or continuous bead welding and coating or adhesives applications are a few common tasks commonly performed by robots. Industrial robots come in a wide variety of sizes and configurations and many are configured to perform specialized functions. Welding and coating application robots are examples of specialty robots; they have controls and program codes that are specific to these operations. In addition, some robot configurations are better suited for assembly than other general-purpose operations, although they could be programmed to perform non-assembly tasks.

Figure 3. CRS Robotics Model A255 5-Axis Articulated, Human Scaled Robot.

The general-purpose robot, the A255, that is used is an articulated arm configuration manufactured by CRS Robotics. The A255 is termed a human scaled robot because the physical dimensions of the robot arm are very similar to those of a human being. Some key features of the A255 are: 5 degrees of freedom; 2 kilogram maximum payload; 1.6 second cycle time, 0.05 mm repeatability and a 560 mm reach. Figure 3 shows the A255 and controller as depicted on the CRS Robotics Web page. The A255 robot model is unmatched in its class of small-articulated robfor high performance, reliability, and overall value. With five degrees of freedom, the A255 Robot performs much like a human arm. In fact, the A255 Robot can handle many tasks done by humans. Like the entire family of robots, the A255 is working throughout the world in industrial applications such as product testing, material handling, machine loading and assembly.

The A255 also has many uses in robotics research, educational programs, and the rapidly expanding field of laboratory automation. The A255 Robot is programmed using RAPL-3 programming language. The English-like syntax of RAPL-3 is easy to learn, easy to use and yet powerful enough to handle the most complex of tasks. Programming features include continuous path, joint interpolation, point-to-point relative motions, a straight-line, plus online path planner, to blend commanded motions in joint or straight-line mode. The A255 Robot is designed to work with a wide variety of peripherals including rotary tables, bowl feeders, conveyors, laboratory instruments, host computers and other advanced sensors. For maximum process flexibility, the A255 robot can be interfaced with third-party machine vision systems. 3.2 Video Capturing Card The M-Vision 1000 (MV-1000) is a monochrome video digitizer board, which can be plugged into the PCI (Peripheral Component Interconnect) bus [Mutech]. The MV-1000 Product line includes a base board with analog camera support, a Memory Expansion module (MV-1200), and a group of specialized acquisition plug-in modules. In this latter group the Digital Camera Interface module (MV-1100), RGB Color Module (MV1300), and NTSC/PAL-Y/C Color Module (MV1350) are available. In this project, the MV-1300 acquisition module is used. It plugs onto the MV-1000 PCI Bus Video Digitizer and can connect to two sets of RGB inputs or two monochrome video inputs. Real time display on the VGA card is possible. The add-on memory module MV-1200 is required to store a full frame of RGB color. The MV-1000 Video Capture Board (figure 4) digitizes standard or non-standard analog camera video into 8 bits per pixel at rates up to 40 million samples per second. The digitized video is stored in on-board VRAM. The 1Mbyte board memory can be expanded to 4 Mbytes with the optional MV-1200 Memory Expansion Module.

Figure 4. The MV-1000 Video Capture Board

The MuTech M-Vision 1000 frame grabber is a full sized PCI bus circuit board. The PCI bus has a number of distinct architectural advantages that benefit video image capture when working at high frame rates, high spatial resolution, or high color resolution.

The PCI bus can transfer data at rates up to 130 Mbytes per second and can run at 33 MHz without contending with the system processor or other high-speed peripherals, such as SCSI disk controllers. It is also truly platform independent, and is not rigidly associated with a "PC" containing an Intel processor. Digital Equipment Corporation is offering an Alpha processor based workstation with a PCI bus. Apple is offering the PCI bus on a Power PC based workstation. On a suitably configured Pentium processor PC, the MuTech M-Vision 1000 is capable of transferring data to PC or VGA memory at up to 55 Mbytes per second. The MV-1000 Software Development Kit enables developers to program the MV-1000 board and create various applications using the C language. A series of MV-1000 SDKs have been designed and implemented to work under each of MS-DOS, Windows 95/98/NT and OS-2 operating environments. The MV-1000 SDK helps programmers and to easily set up the frame grabber to work with various types of cameras. It provides a set of built-in configurations for standard or widely used cameras and a group of more than 30 camera configuration files. The SDK provides a set of standard or widely used camera configurations. A rich set of functional application programs interfaces is provided to accomplish this goal. It also provides pre-determined parameters to set up the MV-1000 so that it will work with the various types of cameras, such as: RS-170, CCIR and camera configuration files for many Kodak, Pulnix, Dalsa, Dage, EG&G and DVI. The SDK includes a group of fine tuned image utility routines. Application developers can use these routines to save grabbed image frames in various image file formats, such as TIFF, TARGA, JPEG, BMP, and/or to display image files. 3.3 Web Camera The 3Com HomeConnect PC digital camera USB (Universal Serial Bus) provides video snapshots, video email, videophone calls over networks. It adjusts automatically from bright to dim light, detaches from cable for easy mobility. The camera appears in figure 5:

Figure 5. 3Com HomeConnect PC Digital Camera

3.4 Video Imager This Panasonic video imager (figure 6) is no longer available on the markets anymore. Any other quality analog camera can replace it.

Figure 6. Video Imager

4. Methodology The development steps are as follows:

Obtaining the robot details: Information regarding interfaces, both hardware and software in the robot controller that could allow it to exchange data with the external environment. Determining interface techniques and methods of data exchange: Once the technique of tapping data from and sending data to the robot controller is determined, it is then possible to decide the methods to be used in exchanging data with the robot controller. Commands are constantly being sent to the robot controller to update the robotic variables (e.g., joint positions, gripper state, etc.). This information is in turn accessed over the network. The method of accessing the robotic variables is determined. This may mean choosing the programming language that allows the model to send and receive data using the same protocol. Under the Windows platform, common protocols are used for such data exchanges include: TCP/IP (transmission control protocol/internet protocol), DDE (dynamic data exchange), OLE (object linking and embedding) and ActiveX. Choosing the programming language to build the interface: The language chosen must suit the robotic application and should be able to use the data exchange protocol to communicate with the robot controller program as well as the Hand Gesture model. Building the Hand Gesture model and verifying the model behavior: Having chosen the programming language that will allow exchange of data into and out of the model, the next step is to build and verify against the physical system. Building the Hand Gesture channel interface: Having constructed the Hand Gesture model, codes are then inserted in the model to enable handshaking and data transfer to and from the interface program. Codes are also inserted into the program to do a similar function. It is not necessary at this stage to implement all the data variables, merely a few; to prototype the interface. Testing the Hand Gesture channel interface: The interface is tested until it is error-free. Building the channel-robot interface: Commands are being sent to the robot controller to enable this exchange of data. Codes are inserted into the program to do the same.

Testing the channel-robot interface: The interface is tested again until it is error free. Building interfaces with full capability: Once the methodology of data transfer is stabilized, tapping of all necessary data is implemented. Developing and using algorithms: scheduling, collision avoidance and algorithms for path planning. Testing and improving those algorithms. Full integration testing.

4.1 Choosing the Programming Language to Build the Interface In general, the particular work in this project involves heavy use of numerical modeling with vast numbers of image pixels. So any language chosen should support this sort of processing. In this project, much of the work involves the production of image displays in real time. Any solution chosen must provide a way to accomplish this. However, while C/C++ costs roughly ten times the number of lines of code, the behavior of that code is more directly under the control of the progr, leading to fewer surprises in the field. The ultimate cost of a body of code must be the sum of the creation time and the future maintenance time. Since scientific software constantly changes, the best hope is that the language itself does not force maintenance activities in addition to the evolutionary activities. Finally, in light of the need to be more productive in non-numerical aspects one would hope for rich and well supported data structures in the language. A programming language such as Fortran offers little beyond numeric arrays. On the other hand C/C++ can produce essentially anything, as long as the programmer is prepared to pay the cost of custom development. This situation is improving slightly now in light of the Standard C++ Template Library. But C/C++ always costs heavily during the development and debugging phases. As a bonus, if the language promotes the development of generic solutions then these can be later reused at little cost thereby allowing their development costs to be amortized over future programs. A well designed C++ routine ought to be able to pay for itself in this way, but the language does not promote solid generic design. Thus, it is harder than one might think to develop fully reusable code in C++. In this project, all the code written in C or C++ with the help of the Intel Image Processing Library. This assures image processing in real time. The Intel Image Processing Library focuses on taking advantage of the parallelism of the new SIMD (singleinstruction, multiple-data) instructions of the latest generations of Intel processors. These instructions greatly improve the performance of computation-intensive image processing functions. Most functions in the Image Processing Library are specially optimized for the latest generations of processors. The Image Processing Library runs on personal computers that are based on Intel architecture processors and running Microsoft Windows 95, 98, 2000 or Windows NT operating systems. The library integrates into the customers application or library written in C or C++.

4.2 Developing and using the Gesture Recognition Algorithm In the proposed methodology we suggest the use of fuzzy C-means [Bezdek, 1974] for clustering. Fuzzy C-means (FCM) clustering is an unsupervised clustering technique and is often used for the unsupervised segmentation of multivariate images. The segmentation of the image in meaningful regions with FCM is based on spectral information only. The geometrical relationship between neighboring pixels is not used [Noordam et al., 2000]. The use of Fuzzy C-Means clustering to segment a multivariate image in meaningful regions has been reported in literature [Park et al., 1998, Stitt et al., 2001, Noordam et al., 2000]. When FCM is applied as a segmentation technique in image processing, the relationship between pixels in the spatial domain is completely ignored. The partitioning of the measurement space depends on the spectral information only. As multivariate imaging offers possibilities to differentiate between both objects of similar spectra and different spatial correlations, FCM can never utilize this property. Adding spatial information during the spectral clustering has advantages above a spectral segmentation procedure followed by a spatial filter, as the spatial filter can not always correct segmentation errors. Furthermore, when two overlapping clusters in the spectral domain correspond to two different objects in the spatial domain, usage of a priori spatial information can improve the separation of these two over-lapping clusters. The fuzzy c-means clustering algorithm is described mathematically as follows: Given a set of n data patterns, objective function dimensional data vector, in the -th cluster, between object , the FCM algorithm minimizes the weighted within group sum of squared error [Bezdek, 1981]: is the prototype of the center of cluster, , , is the number of objects and is a weighting exponent on each fuzzy membership, and cluster center where is the -th p-

is the degree of membership of is a distance measure is the number of clusters. A solution

of the objective function and the cluster centers

can be obtained via an iterative process where the degrees of membership are updated:

and with the constraints: and In the proposed methodology, initially, using a specific fuzzy C-means algorithm to partition the data generates a classifier. Once the clusters have been identified through training, they are labeled. Labeling means assigning a linguistic description to each cluster. In our case, the linguistic description is the name of the gesture.

During the training process several handreds of hand gesture images are inserted into a database. That includes all possible system hand gestures. At least 25 frames out of each of the 12 hand gestures in the language are taken. Every picture gets an identification number, plus a feature vector. The vector of features is also inserted into the database. At this stage, for each one of the pictures, a feature vector, which contains 13 parameters, is built. These vectors contain the height / width and sub-block gray scale values. The subblock gray scale values are obtained by taking an image and divide it by 3 rows and 4 columns, then take the norm of each cell in the image. The selection of 12 gray scale values for each frame empirically was found to provide enough discriminatory power for individual gesture classification. Additional experimentation of the optimal block size is left for the future. During the Fuzzy C-Means clustering process, membership values are determined for each gesture type. The number of membership is equal to the number of gestures in the language. For each picture, a membership vector is built. These membership values are inserted into a database in a matrix form like. The matrix columns are the gesture signs (1 - 12), and the rows are the pictures taken (1 - couple of handreds). Each cell in the matrix gets a numeric value (0 - 1000) that represents how close a gesture to a picture is. 1000 means that there is a high membership relation, and 0 means that there is no relation at all. In this stage, a performance index (cost function) is built. This function is built for estimating how optimal the system is, from the aspect of building the clusters. Figures 7 and 8 describe the training process.

Figure 7. A successful Recognition

Figure 8. Inserting Two Training Gestures

4.3 The Hand Gesture Language Hand gestures are a form of communication among people [Huang and Pavlovi'c, 1995]. The use of hand gestures in the field of human-computer interaction has attracted new interest in the past several years.

Computers and computerized devices, such as keyboards, mice and joysticks may become as stone-aged devices. Only in the last several years has there been an increased interest in tring to introduce the other means of human to computer interaction. Every gesture is a physical expression of a mental concept [Thieffry, 1981]. A gesture is motivated by an interaction to perform a certain task: indication, rejection, grasping, drawing a flower or something as simple as scraching ones head [Huang and Pavlovi'c, 1995]. A robot-accomplished tasks guided by a sequence of human hand gestures through the proposed visual gesture recognition language can be seen in figure 9.

Figure 9. The A255 arm has five axes of motion (joints): 1 (waist), 2 (shoulder), 3 (elbow), 4 (wrist pitch) and 5 (tool roll)

A visual gesture recognition language was developed for this robot (figure 10).

Figure 10. Visual Gesture Recognition Languag

The X+ and X- hand gestures control the X-axes, the Y+ and Y- hand gestures control the Y, and the Z+ and Z- hand gestures control the Z-axis of the A255 robot arm. The Roll Right and Roll Left hand gestures control joint 5, the Open Grip and Close Grip gestures control the robot gripper. The Stop hand gesture quits any action the robot performs. The Home hand gesture resets and calibrates the robot joints automatically. 4.4 Building Gesture Control Interface A gesture control interface was built (figure 11) for controlling a remote robot. On the left upper corner there is a real time hand gesture picture. On the right upper corner there is the segmented hand gesture. On the left

lower corner there is the selection of the 12 gray scale values. Below the segmented hand gesture picture there are thumnails of the 12 different hand gestures and a bar graph that shows if a specific gesture was recognized. The levels of these bars are proportional to the membership level of the current gesture to all the gesture classification. Below that there is a real video feedback from the remote robot site.

Figure 11. Gesture Control Interface

To control a robot, two control modes are available: a) Continuos movement commands: If a gesture command is given (e.g. up) then the robot moves continuously until a stop command is given. All gesture commands must follow (and be followed) by a stop command. b) Incremental movement commands: A gesture command all of time after being received by the robot at time t is carried out over the interval t. Then at time t + t, the robot is ready for a new command. For example, If the left gesture is given, the robot moves during time t, then if the same gesture left is held, it moves again by time t, then if the gesture is changed to right it moves right by time t. This way one can easily position the robot over an object by juggling it left-right-left, etc., until the correct position is observed. In this method, it is not necessary to go through the stop gesture every time a gesture is changed. In this project, this mode was chosen for controlling a remote robot. 4.5 The User-Robot Communication Link 4.5.1 Background on Communication Architecture and Protocols The Internet evolved from ARPANET, the U.S. Department of Defenses network created in the late 1960s. ARPANET was designed as a network of computers that communicated via a standard protocol, a set of rules that govern communications between computers. While the original host-to-host protocol limited the potential size of the original network, the development of the TCP/IP (Transmission Control Protocol/Internet Protocol) enabled the interconnectivity of a virtually unlimited number of computers. Every host on the Internet network has a unique Internet Protocol (IP) address and a unique Internet hostname. For the robot server, the Internet address is 132.72.135.123, for the hand gesture client computer it is 132.72.135.21, and for the web camera server it is 132.72.135.124 (figure 12):

Figure 12. User-Robot Communication Architecture

TCP is responsible for making sure that the commands get through to the other end. It keeps track of what is sent, and re-transmits anything that did not get through. If any message is too large for one datagram, e.g. the hand gestures rate is too large, TCP will split it up into several datagrams, and make sure that they all arrive correctly. Since these functions are needed for many applications, they are put together into a separate protocol. TCP is considered as forming a library of routines that applications can use when they need reliable network communications with another computer. Similarly, TCP calls on the services of IP. Although the services that TCP supplies are needed by many applications, there are still some kinds of applications that do not need them. However there are some services that every application needs. So these services are put together into IP. As with TCP, IP is also considered as a library of routines that TCP calls on, but which is also available to applications that do not use TCP. This strategy of building several levels of protocol is called "layering". Applications programs such as mail, TCP, and IP, are considered as being separate "layers", each of which calls on the services of the layer below it. Generally, TCP/IP applications use 4 layers: a) An application protocol such as mail b) A protocol such as TCP that provides services need by many applications:

IP, which provides the basic service of getting datagrams to their destination. the protocols needed to manage a specific physical medium, such as Ethernet or a point to point line.

TCP/IP is based on the "catenet model" (This is described in more detail in IEN 48). This model assumes that there are a large number of independent networks connected together by gateways. The user should be able to access computers or other resources on any of these networks. Datagrams will often pass through a dozen different networks before getting to their final destination. The routing needed to accomplish this should be completely invisible to the user. As far as the user is concerned, all he needs to know in order to access another system is an "Internet address". This is an address that looks like 132.72.135.124. It is actually a 32bit number. However it is normally written as 4 decimal numbers, each representing 8 bits of the address. Generally the structure of the address gives some information about how to get to the system. For example,

132.72 is a network number assigned by BGU (Ben Gurion University of the Negev). BGU uses the next numbers to indicate which of the campus Ethernets is involved. 132.72.135 happens to be an Ethernet used by the Department of Industrial Engineering and Management. The last number allows for up to 254 systems on each Ethernet. Gesture commands, are sent to the remote robot over the TCP/IP protocol in groups of 5 for assuring that the recognized gesture sent from the operators site is the right one. 4.5.2 Integrating the Camera into the System For grabbing pictures from the USB (Universal Serial Bus) web camera, the FTP (File Transfer Protocol) was used. FTP is an Internet communication protocol that allows uploading and downloading files from a machine connected to a local machine via the Internet. FTP is composed of two parts; an FTP client and an FTP server. The FTP client (Web Camera Client in figure 10) is the software executed on a local machine to send or receive files. The FTP server is software which executes on a server machine on which the files are to be saved or retrieved (Hand Gesture Server in figure 10). To be able to send files to the FTP server and Web server three pieces of information should be provided:

The name of the FTP server - Each FTP server on the Internet executes on a separate machine. Machine's on the Internet have both DNS (Data Name Service) names (e.g. bgu.ac.il, jobjuba.com,...) and TCP/IP addresses (e.g. 132.72.135.3, 209.30.2.100,...). The name or TCP/IP address of the machine hosting the FTP service must be provided. The user identification login information to the FTP server must be provided - FTP servers are protected in such way that only authorized users have the ability to save and retrieve files. To gain access to FTP server, the administrator of the specific server should provide a user identification and a password to access the FTP server. The directory in which to save files should be defined - When connected to FTP server, it allows uploading of files only to particular directories.

5. Testing the System 5.1 Task Definition We are given a robot (A255) under the control of an individual and a plastic structure located on a flat platform as can be seen in figures 13 and 14:

Figure 13. A255 Robot and a Plastic Cup Structure

Figure 14. Close Look at the Plastic Cup Structure

The main task is to push a yellow wooden box into a plastic cup structure using hand gestures and visual feedback only (figures 15 and 16):

Figure 15. A255 Robot, Plastic Cup Structure, and Yellow Box

Figure 16. Close Look at the Plastic Cup Structure and the Wooden Box

5.2 Experimental Results We are given A255 robot under the control of an individual and a plastic cup structure located on a flat platform. For testing and evaluating the system, an experiment was defined. An operator had to control A255 robot and perform a remote task in real time by his hand gestures. The task was to push a yellow wooden cube, located on a top of a piller, into a container adjacent to it. The robot was controlled using hand gestures and visual feedback. The operator performed a set of ten identical experiments. Times to perform each experiment were taken. A resulting learning curve (Figure 17) indicates rapid learning by the operator.

Figure 17. Learning Curve of the Hand Gesture System*


* Note that standard times were reached after four to six trials.

Figure 18. The Overall View of the Experimental Setting

Figure 19. A Typical Control Sequence to Carry out the Task

6.1 Summary of Results This project has described the design, implementation and testing of a telerobotic gesture-based user interface system using visual recognition. Two aspects of the problem have been examined, the technical aspects of visual recognition of hand gestures in lab environment, and the issue concerning the usability of such an interface implemented on a remote A255 robot. Experimental results showed that the system satisfies the requirements for a robust and user friendly input device. The design of the system has incorporated several advances in visual recognition including FCM (fuzzy c-

means) algorithm. While segmentation is not perfect, it is fast, of generally good quality, and sufficiently reliable to support an interactive user interface. Being appearance based, the assumption that the appearance of the hand in the image is relatively constant was made. Given the stable environment of the system, this assumption is generally valid. Of course some variation does occur, especially between people and even between poses formed by one person. This has been taken into account both by using a representative set of training images, and by varying those images during training. Variations in lighting are handled by pre-processing the image to reduce its effect. A final contributor to the success of the networks is a step where images which have been misclassified by the net after initial training, are added to the training set. This helps to fine tune performance on difficult cases. The path of the hand is smoothed with a novel algorithm which has been designed to compensate for the types of noise present in the domain, but also to leave a bounding box movement which is easy for the system to examine for motion features, and appears natural to the user. Just as with natural gesticulation, motion and pose both play a role in the meaning of a gesture. Symbolic features of the motion path are extracted and combined with the classification of the hand's pose at key points to recognize various types of gestures. Using the TCP/IP protocol and transferring of information as a sequence of "datagrams, the communication reliability between the system computers was assured. The result is working system, which allows a user to control a remote robot using his hand gestures. One can analyze how gestures are best used to interact with objects on a screen. Problems with integrating gesture into current interface technology have been pointed out, such as the design of menus and proliferation of small control icons. This work has also pointed out inherent characteristics of gesture that must be considered in the design of any gesture-based interface. While this project has shown that gesture can be made to work as a reliable interface modality in the context of today's graphical user interfaces, it leaves a significant high level question unanswered; Can gesture provide sufficient benefits to the user to justify replacing current interface devices? Many people see speech recognition as a major player in the interface of future workstations because of the freedom it provides the user, but speech by itself can not make a complete interface. There are some operations that it simply can not express well, or which are much more concise with some type of spatial command. Mice, however, will become increasingly impractical as we move away from traditional screenon-desk environments. Gesture provides many of the same benefits for spatial interaction tasks as voice does for textual tasks. Together, speech and gesture have the potential to dramatically change how we interact with our machines. In the near term they can remove many of the restrictions imposed by current interface devices. In the longer term they offer the potential for machines to operate by observing us rather than by always being told what to do. It is our sincere hope that this work has contributed to realizing that potential. We would like to notice that all of the system files are presently located in the Multimedia Laboratory at the Ben-Gurion University of the Negev in the Department of Industrial Engineering and Management and if a copy of these files is desired, it will be provided. Demonstration of the system will also be given upon

request. 6.2 Future Work The system as implemented is capable of recognizing hand gestures and controlling a remote robot in a realistic setting in real time. It is very likely that with some engineering it could become a reliable add-on to current window systems. It would be easy to extend the system's capabilities so gesture could take on more of the interface duties. Solutions to remaining problems of reliability and accuracy appear to be relatively straightforward. It examines what contributed to the success of the system at this level, and what holds it back from better performance. This work also demonstrates the potential of hand gesture as an input device. After an initial learning curve, an experienced user can manipulate objects on a remote site with speed and comfort comparable to other popular devices. The ability to interact directly with on-screen robot and objects seems to be more comfortable to some users than the indirect pointing used in a mouse or joystick. Several examples for future work are: 1. A comparative evaluation using the following gesture control methods:

Hand pendant control. Mouse and keyboard control. Voice control, if equipment become available. Joystick control, if equipment become available.

2. Use of the hand gesture system to control a mobile robot. 3. Use of the hand gesture system to control a virtual robot (fixed or mobile) in a 3D virtual reality model. 4. More extensive analysis of the learning curve. 5. Optimize the feature vector. 6. An experiment for comparing both incremental and continuos control methods.

7. Bibliography

Baudel T. and Rafon M. B. 1993. CHARADE: Remote Control of Objects using Free Hand Gestures, Communications of ACM. vol.36, no.7, pp. 28-35. Bezdek J. C. 1974. Cluster validity with Fuzzy Sets. Cybernetics 3 58-73. Bezdek. J. C. 1981. Pattern Recognition with Fuzzy Objective Functions. Plenum Press, New York.

Bolt R. A. 1980. Put-That-There: Voice and Gesture at the Graphics Interface. Computer Graphics, 14(3). pp. 262-270. Browse R.A. and Little S.A. 1991. The Effectiveness of Real-Time Graphic Simulation in Telerobotics. IEEE Conference on Decision Aiding for Complex Systems, vol. 2, pp. 895-898. Burger J. 1993. The Desktop Multimedia Bible. Addison-Wesley Publishing Company. Cao Y. U., Chen T. W., Harris M. D., Kahng A. B., Lewis M. A. and Stechert A. D. 1995. A Remote Robotics Laboratory on the Internet, Commotion Laboratory, UCLA CS Dept., Los Angeles. Cassell J., Steedman M., Badler N., Pelachaud C., Stone M., Douville B., Prevost S. and Achorn B. 1994. Modeling the Interaction Between Speech and Gesture. Proceedings of the 16th Annual Conference of the Cognitive Science Society, Georgia Institute of Technology, Atlanta, USA. Crowley J., Berard F. and Coutaz J. 1995. Finger Tracking as an Input Device for Augmented Reality. In Proceedings Intelligence Workshop on Automated Facand Gesture Recognition. Zurich. Cui Y. and Weng J. 1995. 2D Object Segmentation from Fovea Images Based on Eigen-Subspace Learning. In Proceedings IEEE Intelligence Symposium on Computer Vision. Dario P., Guglielmelli E., Genovese V. and Toro M. 1996. Robot assistants: Applications and Evolution. Robotics and Autonomous Systems, vol. 18, pp. 225-234. Darrell T. and Pentland A. 1995. Attention-Driven Expression and Gesture Analysis in an Interactive Environment. In Proceedings Intelligence Workshop on Automated Face and Gesture Recognition. Zurich. Durlach I. N. and Mavor S. N. 1994. Virtual Reality: Scientific and Technological Challenges, pp. 304-361. Ejiri M. 1996. Towards Meaningful Robotics for the Future: Are We Headed in the Right Direction. Robotics and Autonomous Systems, vol. 18, pp. 1-5. Freeman W. and Weissman C. 1995. Television Control by Hand Gestures. In Proceedings Intelligence Workshop on Automated Face and Gesture Recognition. Zurich. Fels S. and Hinton G. 1995. Glove-TalkII: An Adaptive Gesture-to-Format Interface. Proceedings of ACM CHI '95 Conference on Human Factors in Computing Systems, pp. 456-463. Franklin C. and Cho Y. 1995. Virtual Reality Simulation UsingHand Gesture Recognition. University of California at Berkeley,Department of Computer Science. Goldberg K., Maschna M. and Gentner S. 1995. Desktop Teleoperation Via The WWW, Proceedings of the IEEE International Conference on Robotics and Automation, pp. 654-659, Japan. Goldberg K., Santarromana J., Bekey G., Gentner S., Morris R., Wiegley J. and Berger E. 1995. The Telegarden. In Proceedings of the ACMSIGGRAPH.

Huang T. S. and Pavlovi'c V.I. 1995. Hand Gesture Modeling, Analysis, and Synthesis. University of Illinois, Beckman Institute. http://www.3com.com/ http://www.crsrobotics.com/ http://www.mutech.com/ http://www.panasonic.com http://telegarden.aec.at/ http://telerobot.mech.uwa.edu.au/ Ioannides A. and Stemple C. 1998. The Virtual Art Gallery / Streaming Video on the Web, Proceedings of the conference on SIGGRAPH 98: Conference Abstracts and Applications, pp.154. Kahn R. E., Swain M. J., Prokopowicz P.N. and Firby J. 1996. Gesture Recognition Using the Preseus Architecture. Department of computer science. University of Chicago. Kawamura K., Pack R., Bishay M., and Iskarous M. 1996. Design Philosophy for Service Robots, Robotics and Autonomous Systems, vol. 18, pp. 109116. Kervrann C. and Heitz F. 1995. Learning Structure and Deformation Modes of Non-Rigid Objects in Long Image Sequences. In Proceedings Intelligence Workshop on Automated Face and Gesture Recognition. Zurich. Kortenkamp D. Huber E. and Bonasso P. 1995. Recognizing and Interpreting Gestures on Mobile Robot. Robotics and Automation Group. Houston. NASA. Krueger M. W. 1991. Artificial Reality II. Addison-Wesley. NcKee G. 1995. A Virtual Robotics Laboratory for Research, SPIE Proceedings, pp. 162-171. Lee J. and Kunii T. 1995. Model-Based Analysis of Hand Posture. IEEE Computer Graphics and Applications (SIGGRAPH'95). vol. 15, no. 5. pp. 77-86. McNeill. D. 1992. Hand and Mind.University of Chicago Press. Noordam J. C., van den Broek W.H.A.M. and Buydens L.M.C. 2000. Geometrically Guided Fuzzy Cmeans Clustering for Multivariate Image Segmentation. Proceedings of the International Conference on Pattern Recognition. Park S. H., Yun I. D. and Lee S. U. 1998. Color Image Segmentation Based on 3-D Clustering: Morphological Approach. Pattern Recognition, 21(8):1061-1076 Pierce J., Forsberg A., Conway M., Hong S., Zelesnik and R. Mine M. 1997. Image Plane Interaction Techniques in 3D Immersive Environments. Proceedings of the 1997 Symposium on Interactive 3D

Graphics.

Preece J., Rogers Y., Sharp H., Benyon D., Holland S. and Carey T. 1994. Human-Computer Interaction, Adison-Welsey Publishing Company. Quek F. 1993. Hand Gesture Interface for Human-Machine Interaction. In Proceedings of Virtual Reality Systems. Robot Institute of America, 1979. Sheridan T. B. 1992. Telerobotics, Automation, and Human Supervisory Control, Cambridge: MIT Press. Sheridan T. B. 1992. Defining our terms, Presence: Teleoperators and Virtual Environments 1:272274. Sorid D. and Moore S.K. 2000. The Virtual Surgeon, IEEE SPECTRUM July. pp. 26-39. Starner T. and Pentland A. 1995. Visual Recognition of American Sign Language Uses Hidden Markov Models, International Workshop on Automatic Face and Gesture Recognition. pp. 189-194. Zurich. Stitt J. P., Tutwiler R. L. and Lewis A. S. 2001. Synthetic Aperture Sonar Image Segmentation Using the Fuzzy C-Means Clustering Algorithm. Autonomous Control and Intelligent Systems Division The Pennsylvania State Applied Research Laboratory. U.S.A. Sturman, D. and Zeltzer, D. 1994. A Survey of Glove-Based Input. IEEE Computer Graphics and Applications vol., no. 1., pp. 30-39. Taylor K. and Dalton B. 1997. Issues in Internet Telerobotics. International Conference on Field and Service Robotics. Australia. Taylor K. Trevelyan J. 1995. Australia's Telerobot on the Web. 26th International Symposium on Industrial Robots. Singapore. Thieffry S. 1981. Hand Gestures, in The Hand (R. Tubiana, ed.), pp. 482-492, Philadelphia, PA: Sanders. Tou I., Berson G., Estrin G., Eterovic Y. and Wu E. 1994, Strong Sharing and Prototyping Group Applications, IEEE Computer, 27(5): 48-56. Waldron M. B. and Kim S. 1995. Isolated ASL Sign Recognition System for Deaf Persons, IEEE Transactions on Rehabilitation Engineering. pp. 261-271. Vogler C. and Metaxas D. 1998. ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis, Proceedings of the ICCV98. pp.363-369. India. Wexelblat A. 1995. An Approach to Natural Gesture in Virtual Environments. ACM Transactions on Computer Human Interaction, vol.2, no.3. pp. 179-200.

Wilson A., Bobic A. and Cassell J. 1996. Recovering the Temporal Structure of Natural Gesture. In Proceedings of the Second Intelligence Conference on Automatic Face and Gesture Recognition, Killington, Vermont. Triesch J. and Malsburg C. V. D. 1998. A Gesture Interface for Human-Robot Interaction, in Proceedings of 3th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 546-551.

Developer: Kartoun Uri

You might also like