You are on page 1of 21

Final Document: LuteKinduct

Sachi A. Williamson Bachelor of Science, Computer Science Laurie Murphy, Faculty Mentor Pacic Lutheran University CSCE 499 - Spring 2013

Abstract The open-sourced availability of Microsoft Kinect development tools and the broad accessibility of the hardware enable the suitable development of gesture-based software designed to teach basic musical conducting skills to users of any age and background. Using Microsofts motion-sensing camera, LuteKinduct allows a user to adjust playback speed and volume of an audio le in real time. This is implemented in C# and uses the Kinect SDK, SoundTouchSharp, and NAudio libaries to apply a time-stretching algorithm by comparing average velocities of gestures. Finally, the project served as a model for an interdisciplinary capstone project combining computer science and the arts.

Contents
1 Introduction 1.1 What is the Kinect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Requirements 2.1 Functional Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Development Resources . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Performance Requirements . . . . . . . . . . . . . . . . . . . . . . 2.5 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 External Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Use Cases and Scenarios . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Use Case 1: Load a WAV, AIFF, or MP3 le. . . . . . . . 2.7.2 Use Case 2: Begin conducting a musical piece. . . . . . . . 2.7.3 Use Case 3: Change the speed of the music being played. . 2.7.4 Use Case 4: Change the volume of the music being played. 2.7.5 Use Case 5: Conduct a musical piece. . . . . . . . . . . . . 2.7.6 Use Case Model 6: Stop the musical piece at its nish. . . 3 Design 3.1 Mathematical Theories . . . . . . . . . . . . . . . . 3.2 UML Diagram . . . . . . . . . . . . . . . . . . . . . 3.3 Detailed Use Cases . . . . . . . . . . . . . . . . . . 3.3.1 Case 1: Beginning Audio Playback . . . . . 3.3.2 Case 2: Adjusting Tempo During Playback . 3.3.3 Case 3: Adjusting Volume During Playback 3.3.4 Case 4: Stopping Audio Playback . . . . . . 3.4 Prototypes . . . . . . . . . . . . . . . . . . . . . . . 3.5 External Libraries . . . . . . . . . . . . . . . . . . . 5 5 6 6 7 7 7 7 8 9 9 9 9 9 9 10 10 10 11 13 13 14 15 16 16 17 18 18 19 20 21

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

4 Implementation 4.0.1 Implementation Changes . . . . . . . . . . . . . . . . . . . . . . . . . 4.0.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Future Work 6 References

List of Figures
1 2 3 4 5 6 7 8 9 10 11 The Kinect sensor for Xbox 360 . . . . . . . . . . . . . . . . . . Optimal distance and angles with the sensor . . . . . . . . . . . Velocity formula for each frame . . . . . . . . . . . . . . . . . . Beginning audio playback . . . . . . . . . . . . . . . . . . . . . Adjusting tempo during playback . . . . . . . . . . . . . . . . . Adjusting volume during playback . . . . . . . . . . . . . . . . . Stopping audio playback . . . . . . . . . . . . . . . . . . . . . . The Depth/Skeleton/Color Viewer Prototype from Kinect SDK Main window prototype at end of Fall 2012 . . . . . . . . . . . . Final opening GUI window . . . . . . . . . . . . . . . . . . . . . Final Main Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 10 13 14 15 16 17 17 19 20

1
1.1

Introduction
What is the Kinect?

The Kinect sensor is a motion-sensing camera that was released by Microsoft in 2010 originally exclusively for the Xbox 360. Although similar to other cameras that were experimented for video games (such as the Eye Toy), this camera was delivered at a peak time when public interest and popularity was rapidly increasing following the release of another motion-based system, the Nintendo Wii. The camera received a Guinness World Record for fastest-selling consumer electronics device at eight million sold in sixty days.
1

The Kinect SDK was announced and open to developers in 2011, primarily for application development on a Windows computer. A sensor solely in conjunction with the SDK was released a year later for a similar price point as the sensor for the Xbox. This project was developed with the Xbox 360 version, and did not seem to pose any major problems despite rumors of incompatibility. The Kinect sensor itself has an infrared projector and a monochrome CMOS (complimentary metal-oxide semiconductor). A color VGA video camera is in between the two that detects three colors and is referred in documentation as the RGB camera. The two other projectors are used primarily to allow the sensor to capture an image of a room in 3-D, while the RGB camera is used for facial recognition. The infrared projector and monochrome CMOS are usually grouped together as a depth sensor, and will be referred to as such throughout the paper. The video and depth sensors have a 640 x 480-pixel resolution and function at 30 FPS. The recommended distance of space between the sensor and the user is around 6 feet, although a recent release allows a seated position of the user. In the seated position, measurements are computed dierently from a standing position, but this position is not used for the project.
1

http://en.wikipedia.org/wiki/Kinect

Figure 1: The Kinect sensor for Xbox 360

2
2.1

Requirements
Functional Objectives

The creation of a gesture-based virtual music conducting application implemented in C# and .NET Framework with Microsofts Kinect Supports WAV, AIFF, and MP3 audio formats Allows tempo and volume adjustments in real-time during audio playback based on the users arm gestures Uses the NAudio and SoundTouch libraries, as well as a wrapper class, SoundTouchSharp, and another external class, AdvancedBueredWaveProvider as a means to eciently focus more on the Kinect implementation rather than time-stretching Changes tempo of the audio le using recording of right-hand coordinates, calculation of velocities, comparison of velocities, and adjustment based on the rate of change Adjusts volume when detecting z-coordinate of the left hand against the right, then volume percentage according to the y-values of the left arm Starts and stops audio playback according to the arm level of user Displays a minimalistic initial general user interface (GUI) using Windows Forms with a preview window with tilt angle adjustment Shows a window that has both the Skeleton stream and color stream that reects the user during audio playback

2.2

Learning Objectives

To gain knowledge of C#, the Kinect SDK and connection of open-source libraries To learn how to more eciently and eectively research using online materials To improve problem-solving (debugging) skills and software engineering methods and processes (particularly some of Test-Driven Development and Agile Methodology)

2.3

Development Resources

Since the Kinect SDK was released by Microsoft in 2011, it has been given appropriate improvements and enhancements for better accessibility to the public for open-source projects. As of October 2012, Microsoft has released version 1.6 of the SDK that oers developer tools, more exposure of the internal workings of the hardware itself, and support with Visual Studio 2012. Consequently, more print and online resources related to development with the Kinect SDK have been released.

2.4

Performance Requirements

Due to the continuous improvements being made on the Kinect SDK, it is challenging to predict precisely the runtime capabilities of the program. However, there should be minimal delays (less than a second) between gestures and audio playback or the user will become frustrated. Additionally, the application itself will have a reasonable overhead due to the complexity of initializing the sensor and using the SDK, but should perform well enough to not consume too much memory or CPU power.

2.5

Design Constraints

The optimal environment for both lighting and visual recognition of the hardware will need to be taken into consideration. Intuitive yet distinct gestures are needed for proper functionality of gesture-recognition. Subsequently, one user can only be present or the sensor will not be able to follow the primary user. Of course, some anomalies are unavoidable. For instance, during testing, the sensor occasionally misinterpreted a chair in the background as a user and expected it to conduct intermittently. A diagram provided by Microsoft of sucient angles and distances from the sensor is shown below in Figure 2.

Figure 2: Optimal distance and angles with the sensor The design constraints are also mainly dependent on the capacities of the frameworks for the Kinect SDK, as well as the hardware of the Kinect itself. Similarly, it was dicult to weigh the benets and consequences of using external libraries to control audio playback and a time-stretching algorithm. Design scope for the gesture style was also challenging, trying to retain precise measurements while not using machine learning or advanced algorithms. Lastly, gesture recognition inconsistancy needed to be taken into account, since optimal lighting conditions are not always present or the sensor may just malfunction for a brief period of time. This was eventually solved by using the TransformSmoothing function in the Kinect SDK to prevent jitter from the joints.

2.6

External Interface

The external interface is the Kinect sensor, to which the computer is connected to through the USB port. The sensor itself requires a separate power cord since it needs more power than a 2.0 USB port can provide, which was purchased in the fall semester. The only other requirements with the Kinect are that the SDK and its appropriate plugins need to be installed on the computer before use which can be found on the developer website. Thankfully, the integration with Windows and the Kinect are relatively seamless compared to third-party hardware complications.

2.7
2.7.1

Use Cases and Scenarios


Use Case 1: Load a WAV, AIFF, or MP3 le.

The user starts up the application and is given a Browse button. Once selected, the interface will lter out les that are not within the specied formats and show a window that will allow the user to select the le and load it into the player. 2.7.2 Use Case 2: Begin conducting a musical piece.

The user wants to start performing a musical piece. Kinect hardware recognizes that the joints and coordinates of both arms are in the ready position (out in front of the body at abdomen level) and informs the software. Software loads the audio le to be played and waits for the coordinates of both arm joints to be raised above the abdomen level then plays the le back accordingly. 2.7.3 Use Case 3: Change the speed of the music being played.

The user wants to increase the tempo of the piece by moving arms in a faster motion. Kinect hardware recognizes the positions of the joints appear at a quicker rate and informs the software of the sudden change. Software recognizes the faster arrival of positions of joints from the gestures and uses a time-stretching algorithm accordingly. 2.7.4 Use Case 4: Change the volume of the music being played.

The user wants to increase the volume of the piece by moving the left arm in a perfectly upward motion (raising one hand up and down). Kinect hardware recognizes the positions of that particular side of the body is in a vertical motion relative to the positions of the coordinates of the joints and informs the software of the sudden change. Software recognizes the vertical changes of the positions of the arm and adjusts the volume of the MIDI playback accordingly. 2.7.5 Use Case 5: Conduct a musical piece.

The user would like to conduct a musical piece after loading a MIDI le. The user will begin playback by raising both arms above the abdomen level, then moving both arms outside of the body in a U-shaped motion or similar (depending on the time signature of the musical piece). Kinect hardware will recognize the gestures by mapping coordinates of the positions of the wrist joints and comparing coordinates to previous positions of the wrist; it will send the information of the coordinates to the system. System will translate the coordinates given and adjust the volume and tempo accordingly (see Use Cases #3 and 4). 9

2.7.6

Use Case Model 6: Stop the musical piece at its nish.

The user would like to end the musical piece when the audio le is nished playing. System will announce to the user via text that the piece is nearly nished. The user will nish the piece by making a counter-clockwise circular motion and hold the wrists steady for approximately two or three seconds. Kinect hardware searches for either circular motion or the steady positions of the wrist joints over an elapsed time period and will inform the software. Software receives the stopping information from the hardware and immediately fades playback down.

3
3.1

Design
Mathematical Theories

The mathematics behind gesture-recognition and time-stretching evolved multiple times after reviewing various libraries and strategies for preciseness and accuracy. However, it remained the same that coordinates need to be recorded and velocities calculated. The initial math used was inuenced by a previous project that had been created, Kinductor. Mr. Joshua Blake provided thorough explanations for the coordinate retrieval and calculations beyond velocities that would calculate a beats-per-minute (BPM) measurement and compare that against the audio player. In the end, the math was actually very simple compared to its preliminary stage. First, the Kinect sensor needs to continue documenting each frame by retrieving x, y, and z coordinates as well as the time elapsed between each consecutive frame. Once those are stored in a list-type object, the velocities are calculated from the formula written below in Figure 3.

Figure 3: Velocity formula for each frame The velocities compare two frames that are one after the other, then compute that average and store that value into another list. Then, consecutive values in that list of velocities are compared to compute the rate of change ratio that adjusts the tempo. Although simple, this method of calculation works mostly consistently and has given linear results.

10

3.2

UML Diagram

LuteKinductBackend performs the main calculations and analysis of the data, such as calculating average velocities, storing coordinates, and adjusting audio playback. Both the NAudio and SoundTouch (Sharp) libraries are connected to the backend. The class that reads in the coordinates directly from the sensor is the SkeletonStream class, which draws the skeleton stream in the window during playback, as well as records the coordinates. Lastly, the Colorstream was initially used as a prototype, but worked to oer the user a tilt angle window. The UML diagram is shown below.

11

12

3.3
3.3.1

Detailed Use Cases


Case 1: Beginning Audio Playback

To begin audio playback, after the user has selected the desired audio le from the initial window, the user then is brought to the main window. Upon raising both arms above their shoulder, the Kinect sensor raises a boolean ag to the SkeletonStream class that indicates the arms are above the shoulder. When both arms are put below the shoulder but above the hip joint, the hardware raises another ag to start, while simultaneously retrieving the x and y coordinates in the upward-then-downward motion. LuteKinductBackend will then record the initial velocity from those coordinates found from the SkeletonStream and start the playback according to the rate of change in the velocities. The sequence diagram can be seen below in Figure 4.

Figure 4: Beginning audio playback

13

3.3.2

Case 2: Adjusting Tempo During Playback

When the user moves their arms, the sensor retrieves the x, y, and time values every frame, up to twenty-eight times total. These values will be sent to LuteKinductBackend from the SkeletonStream class and recorded in an array of Hand Coordinate structs, which have parameters of the x and y positions, as well as the time. After lling the twenty-seventh value in the array (twenty-eight frames total), LuteKinductBackend calculates the velocities for each frame. Then, the application compares the velocities of every fourteenth frame (had been variable with testing of nine and six frames, but fourteen gave optimal results). After comparing velocities for average velocities, the LuteKinductBackend class compares the average velocity to the previous value (in an array of size two), and assigns the rate of change between the two. The rate of change value is then passed in to the tempo in SoundTouch and adjusts the playback of the audio le appropriately. Figure 5 shows this sequence diagram.

Figure 5: Adjusting tempo during playback

14

3.3.3

Case 3: Adjusting Volume During Playback

In the event that the user wants to change the volume of the audio le during playback, they rst will extend their left arm out straight in front of their body past the right arm but parallel to it. The sensor will raise a ag that the left arm is out further than the right. When the user moves their arm upwards or downwards (above the hip joint), the hardware will retrieve the y values with the SkeletonStream class and the computer will adjust the volume in NAudio from those y-values in LuteKinductBackend. Figure 6 below shows detailed use case.

Figure 6: Adjusting volume during playback

15

3.3.4

Case 4: Stopping Audio Playback

When the user drops both arms below the hip level, the sensor ags the SkeletonStream that the y values have fallen below a certain threshold. LuteKinductBackend then ushes the libraries, clears arrays, and makes more space for the memory. The audio playback will then stop and the sensor will once again wait for both arms to go above the shoulder level to begin again. This sequence diagram can be seen in Figure 7.

Figure 7: Stopping audio playback

3.4

Prototypes

In the fall and J-Term semesters, prototypes were created to become familiar with both C# and the Kinect SDK. The Kinect SDK experimentation was especially important with syntax and building applications using the library, but SoundPlayer (not used), SoundTouchSharp, NAudio, and the .NET Framework were also tested. Preliminary GUIs were also created that retained many of the same elements in the nal version, such as the skeleton stream, color stream, and tilt angle. Overall, prototyping was an integral part of the design process for both eshing out the details of how the nal product should look and how the many complex parts in the application itself function with each other. Two of the prototypes can be seen below in Figures 8 and 9. Figure 8 is a prototype of each of the sensor views, of which the skeleton and color streams were used for the last version of the project. Figure 9 shows the rst window the user sees to import an audio le and enable tempo and volume adjustment, as well as tilt the angle of the sensor.

16

Figure 8: The Depth/Skeleton/Color Viewer Prototype from Kinect SDK

Figure 9: Main window prototype at end of Fall 2012

3.5

External Libraries

The external libraries ultimately chosen for the project were found at the beginning of the fall semester. However, there was little documentation of how to import packages and use the libraries in classes. After compiling the SoundTouchSharp class le into a .dll le for use with the SoundTouch library, the inclusion of the three libraries seemed rather simple. Of course, there were many complications, such as odd errors over x86 compiling versus x64 and correctly using the most current data structures and classes for the NAudio library. The greatest diculty was connecting SoundTouchSharp with NAudio. Previous implementation was in the PracticeSharp application, which SoundTouchSharp was initially designed for use with, but taking apart the code line by line to nd necessary lines was quite a challenge. In the end, the SoundTouchSharp library was correctly connected with NAudio after several hours of commenting and testing. NAudio was used for the audio playback itself and volume adjustment, and SoundTouchSharp, in conjunction with the AdvancedBueredWaveProvider class created again for the PracticeSharp application, adjusted samples and played them through NAudio. 17

Implementation

After researching and prototyping, implementation was not as brain-numbing as rumored, even though the connection between SoundTouchSharp and NAudio gave its own challenges. Each component of the project was implemented and tested as separately as possible to avoid ambiguity in the debugging proces, starting o with the initial window with OpenFileDialog functionality. After that GUI worked, I integrated it with the ColorStream tilt-angle window. Following that step, I created the SkeletonStream and tested out each function before moving on with another. LuteKinductBackend was connected to the SkeletonStream, and I then connected NAudio and SoundTouchSharp together to connect as a whole to the LuteKinductBackend class. Similarly, before the components were added to the main application, each was further prototyped to more similar cases for the project than the prototypes created in the fall semester, such as testing the connectivity of SoundTouchSharp and NAudio. Using methods from Extreme Programming (to an extent), story cards were created and code refactoring took place around every two weeks as an attempt to maintain clean code. If the project is further rened, more usage of Extreme Programming and Agile Methodology will be used. The main issues were connecting SoundTouch and NAudio, as well as retrieving precise and accurate readings for the coordinates with velocities. SoundTouch and NAudio were nally connected by sorting through all of the code in the PracticeSharp application and analyzing the function each line demonstrated. Precise enough readings were found through rigorous testing of variable frame numbers, as well as usage of le printing for post-processing. 4.0.1 Implementation Changes

Initially in the fall semester, it was proposed to use a beats-per-minute calculation from the velocities to adjust the speed of audio playback. Instead, after reading and testing the SoundTouchSharp class, the library does not actually need the robustness of a complicated time-stretching package, but just a simple one that can speed up and slow down audio playback. Even further, the rate of change can be computed by comparing velocities and then put into the SoundTouchSharp class directly, which eliminated many unnecessary steps. The nal images of the application are shown below in Figures 10 and 11, with the main window with the user above the image of the nal version of the initial window.

18

Figure 10: Final opening GUI window 4.0.2 Testing

Although time constraints only allowed limited testing, some exception handling was addressed, such as an invalid le type inputted by the user in the text area. Issues arising from not ushing out the buers for SoundTouch were also handled when the user stops playback and restarts quickly. Lastly, time-stretching modications were made for preciseness, such as comparing a certain number of frames over others (comparing every three frames versus every nine frames). More testing would be necessary, especially with inconsistency of coordinate recognition.

19

Figure 11: Final Main Window

Future Work

The program successfully adjusts the speed and volume of an audio le in real-time based on the users left and right hand gestures that are picked up by the Kinect sensor. More precise adjustments to the time-stretching library likely would be needed, as there seems to be some time delay between calculating velocities and adjusting the tempo, likely from the library side. Additionally, more work needs to be done to more precisely capture the y-position of the left hand arm movement, especially when quickly trying to adjust the volume. Overall, I was very satised with the nal project and it was certainly a signicant learning experience with designing, prototyping, and completing an application that is (nearly) uniquely of my own creation.

20

References

J. Ashley, J. Webb, Beginning Kinect Programming with the Microsoft Kinect SDK, 1st Edition, New York: Apress, 2012. S. Crawford, (2012 December 02). How Microsoft Kinect Works. [Online]. Available: http://electronics.howstuworks.com/microsoft-kinect2.htm D. Catuhe, Programming with the Kinect for Windows Software Development Kit, Redmond: Microsoft Press, 2012. S. Kean, J. Hall, P. Perry, Meet the Kinect: an Introduction to Programming Natural User Interfaces, New York: Apress, 2011. J. Liberty, Programming C#, Sebastopol: OReilly and Associates, Inc., 2001. M. Heath, (2012 October 12). NAudio. Available: http://naudio.codeplex.com/ Microsoft Corporation. (2012 October 10). Kinect for Windows SDK Documentation. [Online]. Available: http://msdn.microsoft.com/en-us/library/hh855347.aspx Microsoft Corporation. (2012 October 12). Human Interface Guidelines. [Online]. Available: http://www.microsoft.com/en-us/kinectforwindows/develop/learn.aspx Microsoft Corporation. (2012 December 07). SoundPlayer Class (System.Media). [Online]. Available: http://msdn.microsoft.com/en-us/library/system.media.soundplayer.aspx Microsoft Corporation. (2012 December 06). Coordinate Spaces. [Online]. Available: http://msdn.microsoft.com/en-us/library/hh973078.aspx O. Parvianinen. (2012 October 10). Sound Touch Audio Processing Library: SoundStretch Audio Processing Utility. Available: http://www.surina.net/soundtouch/soundstretch.html Y. Naveh. (2012 October 11). PracticeSharp. Available: http://code.google.com/p/practicesharp/ Additional thanks to Prof. Laurie Murphy, Dr. David Wol, Dr. Sean ONeill, Dr. Edwin Powell, Joshua Blake, Yuval Naveh, and the contributors on StackOverow.

21

You might also like