You are on page 1of 6

International Journal of Computer Applications (0975 8887)

National Conference on Role of Engineers in Nation Building (NCRENB-15)

Real Time Skeleton Tracking based Human Recognition


System using Kinect and Arduino

Satish Prabhu Jay Kumar Amankumar Dabhi Pratik Shetty


B.E EXTC Bhuchhada B.E EXTC B.E EXTC
VIVA Institute of B.E EXTC VIVA Institute of VIVA Institute of
Technology VIVA Institute of Technology Technology
Technology

ABSTRACT
A Microsoft Kinect sensor has high resolution depth and The eight servos are placed on the shoulders, elbows, hips,
RGB/depth sensing which is becoming available for wide and knees of the robot. The servo motor is a DC motor. The
spread use. It consists of object tracking, object detection and rotation of servo motor depends upon the number of signal
reorganization. It also recognizes human activity analysis, pulses applied to the servo motor. Suppose it is assume that
hand gesture analysis and 3D mapping. Face expression for one pulse the motor rotates through 1 degree, than for 90
detection is widely used in computer human interface. Kinect pulses it will rotate through angle of 90 degree, for 180 pulse
depth camera can be used for detection of common face rotates through 180 degree and so on.
expressions. Face is tracked using MS Kinect which uses 2.0 The second important part of paper is angle calculation. The
SDK. This makes use of depth map to create a 3D frame skeleton information from the Kinect is stored in the computer
model of the face. By recognizing the facial expressions from which thus runs a program used by Arduino to calculate the
facial images, a number of applications in the eld of human angle inclination of every joints of the human body. This
computer can be build. This paper describes about the angle calculation is than converted into a pulse train for each
working of Kinect and use of Kinect in Human Skeleton servo motor connected to Arduino. According to the received
Tracking. pulse the servo motor rotates through a certain angle which is
observed by Kinect sensor. Hence the robot copies the action
General Terms of the human skeleton.
Skeleton tracking algorithm & Action Recognition
The third important part of the project is to extend the concept
Keywords of using the project on internet. So through internet the robot
Skeleton Tracking, Kinect, Pose Estimation, Arduino, Actions can be operate anywhere around the globe. To do so the user
sets the external IP address of the computer in the Arduino
1. INTRODUCTION program through this the robot will emulate the human action
Mobile robots have thousands of applications, from anywhere from the earth through internet.
autonomously mapping out a lawn and cutting grass to urban
search and rescue autonomous ground vehicles. One 2. RELATED WORK
important application in the future would be to fight wars in The project deals with making a robot that will copy human
place of humans. That is humans will fight virtually and action. Recently, Microsoft released the Xbox Kinect, and it
whatever move human makes the same move the mobile robot proves useful for detecting human actions and gestures. So in
will copy. To achieve this it is required to teach robot how to this paper we propose to use Kinect camera to capture the
copy human actions. So project deals with making a robot human gestures and then relaying these actions to the robot
that will copy human action. which will be controlled by Kinect and Arduino board.
The idea is to make use of one of the most amazing 2.1 Existing Systems
capabilities of the Kinect: skeleton tracking. This feature Previously depth images were been recorded with the help of
allows us to build a servo driven robot that will copy human the silhouettes which are nothing but the contour of the body
actions efficiently. Natural interaction applied to the robot has part whose depth images is to be formed [1]. They reject the
an important outcome; there is no need for physical shadow part of the body or the colour of the clothes the person
connection between the controller and the robot. This project has worn. It just simply sees the border of the body. But for
will be extended to implement network connectivity that is the the digital system its been very difficult to predict the motion
robot could be controlled remotely from anywhere in the of the body part of unknown person since this type of model
world. was based on the priori knowledge of the contours. Since the
It will use the concept of skeleton tracking so that the Kinect human of every part of the world are not same and differ in
can detect the users joints limbs movements in space. The size, length and many other physical parameters. Hence it
user data will be mapped to servo angles and send them to the becomes difficult to store all such kind of information.
Arduino board controlling the servos of the robotic robot. Therefore using the silhouettes just simply reduces the scope
of depth images. [1]
The skeleton tracking feature is used to map the depth image The two major steps leading from a captured motion to a
of human. It will track the position of joints of the human reconstructed one are:
body which is than provided to the computer which will in
turn sends the signal to the Arduino board in the form of pulse Marker reconstruction from 2-D marker sets to 3-D
for every joints this will make the servo motor rotate in positions;
accordance with the pulse. Marker tracking from one frame to the next, in 2-D and/or 3-
D.

1
International Journal of Computer Applications (0975 8887)
National Conference on Role of Engineers in Nation Building (NCRENB-15)

However, despite the fact that 2D and 3D tracking ensure approaches which fit and track a known model using image
the identification of a large number of markers from one information.
frame to another, ambiguities, sudden acceleration or
occlusions will often cause erroneous reconstructions or
2.2 Proposed Approach
breaks in the tracking links. For this reason, it has proved to The paper aims at limiting as much as possible the required a
be necessary to increase procedures robustness by using the priori knowledge, while keeping the robustness of the method
skeleton to drive the reconstruction and tracking process by reasonable for most interaction applications. Hence, given
introducing a third step, i.e. the accurate identification of each approach belongs to the third category. [3]Among model-
3-D marker and complete marker inventory in each frame. based methods, a large class of approaches use an a priori
The approaches to solving these issues are addressed in the surface or volume for representation of the human body,
following paragraphs, starting with the presentation of the which combines both shape and motion information [4]. The
human model used and keeping in mind that entire approach corresponding models range from fine mesh models to coarser
is based on the constant interaction between the model and the models based on generalized cylinders, ellipsoid or other
above marker processing tasks. geometric shapes. In order to avoid complex estimations of
both shapes and motions as in, most approaches in this class
2.1.1 Skeleton model assume known body dimension. However, this strongly limits
flexibility and becomes intractable with numerous interaction
The skeleton model is controlled by 32 degrees of freedom systems where unknown persons are supposed to interact. A
grouped in 9 joints in 3D space. This is a simplified version more efficient solution is to find a model which reduces shape
of the complete skeleton generally used. It does not include information. To this purpose, a skeletal model can be used.
detailed hands and feet. This model does not include any volumetric information.
Hence, it has fewer dependencies on body dimensions. In
addition, limbs lengths tend to follow biological natural laws,
whereas human shapes vary a lot among population.
Recovering motion using skeletal models has not been widely
investigated and an approach where a skeletal structure is
fitted with the help of hand/feet/head tracking. However,
volumetric dimensions are still required for the arms and legs
limbs. Hence for all the complication and errors in the
technique the use of Kinect in this project has tackled all the
difficulties in the approaches for finding the robust technique.
[3]
3. KINECT & ITS WORKING
A Microsoft Kinect sensor has high resolution depth and
RGB/depth sensing which is becoming available for wide
spread use. It consists of object tracking, object detection and
reorganization. It also recognizes human activity analysis,
hand gesture analysis and 3D mapping. Face expression
detection is widely used in computer human interface. It can
Fig 1: Default Skeletal Joint Locations be used to detect and distinguish between different kinds of
objects. The depth information was analysed to identify the
2.1.2 Stereo triangulation different parts of fingers or hands, or entire body in order to
interpret gestures from a human standing in front of it. Thus
3D markers are reconstructed from the 2D data using stereo the Kinect was found to be an effective tool for target tracking
triangulation and action recognition. [5]

2.1.3 Binocular reconstruction Kinect camera consists of an infrared projector, the colour
camera, and the IR camera. The depth sensor consists of the
After reconstructing these 3D markers in the first frame, IR projector combined with the IR camera, which is a
compare the number of reconstructed markers with the monochrome complementary metal- oxide semiconductor
number of markers known to be carried by the subject. As all sensor. The IR projector is an IR laser that passes through a
remaining processing is automatic, it is absolutely essential diffraction grating and turns into a set of IR dots. [6]
that all markers be identified in the first frame. Any marker
not present in the first frame is lost for the entire sequence. The relative geometry between the IR projector and the IR
Therefore, if the number of reconstructed markers is camera as well as the projected IR dot pattern are known. If a
insufficient, a second stereo matching is performed, this time dot observed in an image matches with a dot in the projector
also taking into account markers seen in only two views. [2] pattern, reconstruct it in 3D using triangulation. Because the
dot pattern is relatively random, the matching between the IR
There are three techniques from which the image can be image and the projector pattern can be done in a
tracked without using the marker less approach First, straightforward way by comparing small neighbourhoods
learning-based methods which rely on prior probabilities for using, for example, normalized cross correlation. [6]
human poses, and assume therefore limited motions. Second, In skeletal tracking, a human body is represented by a number
model-free methods which do not use any a priori knowledge, of joints representing body parts such as head, neck,
and recover articulated structures automatically. However, the shoulders, and arms. Each joint is represented by its 3D
articulated structure is likely to change in time, when coordinates. The goal is to determine all the 3D parameters of
encountering a new articulation for instance, hence making these joints in real time to allow fluent interactivity and with
identification or tracking difficult. Third, model-based limited computation resources allocated on the Xbox 360 so

2
International Journal of Computer Applications (0975 8887)
National Conference on Role of Engineers in Nation Building (NCRENB-15)

as not to impact gaming performance. Rather than trying to between distances that will be more likely to lead to more
determine directly the body pose in this high-dimensional accurate pose estimation. However, no significant deviation in
space, Jamie Shotton and his team met the challenge by action recognition performance was observed when the Torso
proposing per-pixel, body-part recognition as an intermediate joint was used instead. [8]
step Shottons team treats the segmentation of a depth image
as a per-pixel classification task (no pairwise terms or 4.1.2 Action Representation
conditional random field are necessary)[4]. Evaluating each For realizing efficient action recognition, an appropriate
pixel separately avoids a combinatorial search over the representation is required that will satisfactorily handle the
different body joints. For training data, generate realistic differences in appearance, human body type and execution of
synthetic depth images of humans of many shapes and sizes in actions among the individuals. For that purpose, the angles of
highly varied poses sampled from a large motion-capture the joints relative position are used in this work, which
database. Then train a deep randomized decision forest showed to be more discriminative than using e.g. directly the
classifier, which avoids over fitting by using hundreds of joints normalized coordinates. Additionally, building on the
thousands of training images. Simple, discriminative depth fundamental idea of the previous section, all angles are
comparison image features yield 3D translation invariance computed using the Torso joint as reference, i.e. the origin of
while maintaining high computational efficiency. [6] the spherical coordinate system is placed at the Torso joint
position. For computing the pro- posed action representation,
4. SKELETON TRACKING only a subset of the supported joints is used. This is due to the
ALGORITHM fact that the trajectory of some joints mainly contains
redundant or noisy information. To this end, only the joints
The depth maps captured by the Kinect sensor are processed
that correspond to the upper and lower body limbs were
by a skeleton-tracking algorithm. The depth maps of the
considered after experimental evaluation, namely the joints
utilized dataset were acquired using the OpenNI API2 [7].
Left shoulder, Left elbow, Left wrist, Right shoulder, Right
The OpenNI high-level skeleton-tracking module is used for
elbow, Right wrist, Left knee, Left foot, Right knee and Right
detecting the performing subject and tracking a set of joints of
foot. The velocity vector is approximated by the displacement
his/her body. More specifically, the OpenNI tracker detects
vector between two successive frames, i.e. vi(t) = i(t)pi(t1).
the position of the following set of joints in the 3D space
which are Torso, Neck, Head, Left shoulder, Left elbow, Left The estimated spherical angles and angular velocities for
wrist, Right shoulder, Right elbow, Right wrist, Left hip, Left frame t constitute the frames observation vector. Collecting
knee, Left foot, Right hip, Right knee, Right foot. The the computed observation vectors for all frames of a given
position of joint gi is implied by vector pi(t) = [x y z]T, where action segment forms the respective action observation
t denotes the frame for which the joint position is located and sequence h that will be used for performing HMM-based
the origin of the orthogonal XY Z co-ordinate system is recognition, as will be described in the sequel. [8]
placed at the centre of the Kinect sensor.
4.1.3 HMM based recognition
4.1 Action recognition Markov Models is stochastic model describing the sequence
Action recognition can be further divided into three subtypes of possible events in which the probability of each event
depends only on the state attend in the previous event. This
4.1.1 Pose estimation model is too restrictive to be applicable to current problem of
In particular, the aim of this step is to estimate a continuously interest thus the concept of Markov model is extended to form
updated orthogonal basis of vectors for every frame t that Hidden Markov Model (HMM). HMM is doubly embedded
represents the subjects pose. The calculation of the latter is stochastic process with the underlying stochastic process i.e.
based on the fundamental consideration that the orientation of not observable (it is Hidden) but can only be observed through
the subjects torso is the most characteristic quantity of the set of stochastic process that produce the sequence of
subject during the execution of any action and for that reason observations. [12].
it could be used as reference. For pose estimation, the position
of the following three joints is taken into account: Left HMMs are employed in this work for performing action
shoulder, Right shoulder and Right hip. These comprise joints recognition, due to their suitability for modelling pattern
around the torso area, whose relative position remains almost recognition. In particular, a set of J HMMs is employed,
unchanged during the execution of any action. The motivation where an individual HMM is introduced for every supported
behind the consideration of the three aforementioned joints, action aj. Each HMM receives as input the action observation
instead of directly estimating the position of the torso joint sequence h (as described above) and at the evaluation stage
and the respective normal vector, is to reach a more accurate returns a posterior probability P (aj|h), which represents the
estimation of the subjects pose. It must be noted that the observation sequences fitness to the particular model. The
Right hip joint was preferred instead of the obvious Torso developed HMMs were implemented using the software
joint selection. This was performed so that the orthogonal libraries of Hidden Markov Model Toolkit (HTK). [8]
basis of vectors to be estimated from joints with bigger in

3
International Journal of Computer Applications (0975 8887)
National Conference on Role of Engineers in Nation Building (NCRENB-15)

Fig 2: Initialization of Kinect Camera

5. METHODOLOGY frame in Fig 2 indicates that neither the object is been


The entire process is divided in two parts i.e. Initialization & detected nor the skeletal joints are detected. This kind of
working. image results into blackening of frame and the white spots on
the black frame are due to noises present in the environment.
Once the Joints are been recognized/detected Kinect uses
5.1 Initialization HMM algorithm for joint estimation and predicts the future
For the smooth functioning & Error free working the Kinect is movements. These recognized joint information are been
initialized to its default mode. Initialization is done with the converted into PWM pulses by the programmed PWM pulse
help of calibration card been provided by the Microsoft, this generator present on Arduino board. The generated PWM
card helps to align the Tx and Rx Infrared Sensor of Kinect. pulses which serve as input to the servo motors, are been
Fig 1 indicates the default joint location which is been used, made to perform angular tilt as per the movement been
these are treated as the reference joints and with the help of captured. Since this is real time the entire process is been
these joints other joints are been calibrated. continuously repeated for each frame.

Fig 3: Working of stage 1

5.2 Working 6. RESULT


Initially Infrared Rays (IR) are emitted from the IR transmitter The framework required for the robot can be seen from the fig
of Kinect Camera. Emitted rays are been received by Kinect 6. Along with the robot PCB is made which will help to
receiver which is been stored in its database. Since it is interface the servo motors HS 311 and HS 55. The PCB
monitoring for the human joints, it waits until the human interfacing for the servo is formed so that connection remains
joints are recognized. If any object other than the skeleton proper and it looks proper and compact which can be seen in
oints are recognized it discards the frame and restarts the fig 5. Hence the kinect camera is successfully interfaced
scanning of the next frame until joints are recognized. Black through OpenNI and the tracked the skeleton.

4
International Journal of Computer Applications (0975 8887)
National Conference on Role of Engineers in Nation Building (NCRENB-15)

Fig 4: Working of stage 1I

users to deal with it.. Users are allowed to control the robot
7. CONCLUSION just by mimicking the gestures they want to be performed by
After analysing the studies mentioned above, it can be
the robot in front of the depth camera. This should be seen as
concluded that the Kinect is an incredible piece of technology,
preliminary work, where elementary interaction tools can be
which has revolutionized the use of depth sensors in the last
provided, and should be extended in many different fashions,
few years. Because of its relatively low cost, the Kinect has
depending on the tasks the robot. [11]
served as a great incentive for many projects in the most
diverse fields, such as robotics and medicine, and some great 8. FUTURE SCOPE
results have been achieved. Throughout this project, it was With the progress in the Kinect technology in the last decade
possible to verify that although the information obtained by it can be seen as a revolutionary tool in robotics. Now further
the Kinect may not be as accurate as that obtained by some modification may be as follows:
other devices (e.g., laser sensors), it is accurate enough for
many real life applications, which makes the Kinect a 1. Here only few set of joints are tracked. So now the tracking
powerful and useful device in many research fields. And thus algorithm can be expanded to track all the joints in the human
a real-time motion capture robot is integrated and tested using body and can have more reliable and robust copying of human
Kinect camera. The paper proposed a natural gesture based action.
communication with robot. The skeleton tracking algorithm
has been well explained for further work. The results are
better than the techniques that were used before Kinect
camera.

Fig 6: Robot Layout


2. As Kinect camera used is not portable so reducing the size
of Kinect camera to the size of mobile phone camera can be a
good future development.

3. The servo motors used could be further investigated and


changed to build the system more robust and natural.

Fig 5: PCB with Servo Interfaced 4. The robot built is fixed. Instead it can be made mobile.
Thus not only it will copy human action but even move
Learning from demonstration is the scientific field which around like a human.
studies one of the easier ways a human have to deal with a
humanoid robot: mimicking the particular task the subject 5. It is possible to implement this project over the network.
wants to see reproduced by the robot. To achieve this a That is the Kinect camera will feed the data in the network
gesture recognition system is required. The paper presents a and then the robot will get the data from network and thus it is
novel and cheap humanoid robot implementation along with a possible to control the robot by sitting in any corner of the
visual, gesture-based interface, which enable world.

5
International Journal of Computer Applications (0975 8887)
National Conference on Role of Engineers in Nation Building (NCRENB-15)

9. REFRENCES [6] Z. Zhang, Microsoft Kinect Sensor and Its Effect, in


[1] Agarwal, A., Triggs, B. 3D human pose from IEEE Multimedia Magazine, vol. 19, no. 2, pp. 4-10,
silhouettes by relevance vector Regression. In April- June 2012.
Proc. IEEE International Conference on Computer [7] James Ashley and Jarrett Webb, (Ed.), Beginning Kinect
Vision and Pattern Recognition, pp.882-888, 2004. Programming with the Microsoft Kinect SDK, Apress,
[2] Lorna HERDA, Pascal FUA, Ralf PLNKERS, 2011.
Skeleton-based motion capture for robust reconstruction [8] Georgios Th. Papadopoulos, Apostolo Axenopoulo and
of human motion, in Proc. Computer Animation 2000, Petros Daras, A Compact Multi-view Descriptor for 3D
pp. 77-83, 2000. Object Retrieval, in Content-Based Multimedia
[3] Clement Menier, Edmond Boyer, Bruno Raffin, 3D Indexing, pp.115-119, 2009.
Skeleton-Based Body Pose Recovery, in Proc. 3rd [9] Michael Margolis, (Ed.), Arduino Cookbook, OReilly,
International Symposium, 3D Data Processing, 2011.
Visualization and Transmission, pp 389396, 2006 .
[10] Jack Purdum, (Ed.), Beginning C for Arduino, Apress,
[4] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew 2011.
Fitzgibbon, Real-Time Human Pose Recognition in
Parts from Single Depth Images, in the Proc. [11] Giuseppe Broccia, Marco Livesu, & Riccardo Scateni,
Conference on Computer Vision and Pattern Gestural Interaction for Robot Motion Control, in the
Recognition, pp.1297-1304, 2011. Proc. Eurographics Italian Chapter Conference, 2011.

[5] Dnyaneshwar R. Uttaarwar, Motion Computing using [12] Lawrence R Rabiner, A Tutorial on Hidden Markov
Microsoft Kinect, in the Proc. National conference on Model & Selected Applications in Speech Recognition,
advances on computing, 2013. in Proc. IEEE 77, no. 2, pp 257-286, 1989.

IJCATM : www.ijcaonline.org 6

You might also like