You are on page 1of 6

Swiping with Luminophonics

Shern Shiou Tan, Tomas Henrique Bode Maul


School of Computer Science The University of Nottingham Malaysia Campus Jalan Broga, 43500 Semenyih Selangor Darul Ehsan, Malaysia

Neil Russel Mennie, Peter Mitchell


School of Psychology The University of Nottingham Malaysia Campus Jalan Broga, 43500 Semenyih Selangor Darul Ehsan, Malaysia

AbstractLuminophonics is a system that aims to maximize cross-modality conversion of information, specically from the visual to auditory modalities, with the motivation to develop a better assistive technology for the visually impaired by using image sonication techniques. The project aims to research and develop generic and highly-congurable components concerned with different image processing techniques, attention mechanisms, orchestration approaches and psychological constraints. The swiping method that is introduced in this paper combines several techniques in order to explicitly convert the colour, size and position of objects. Preliminary tests suggest that the approach is valid and deserves further investigation. Index TermsImage Processing, Computer Vision, Auditory Display, Image Sonication

manner. The main low-level difference between the work reported in this paper and other methods is the usage of top-down swiping in conjunction with an emphasis on colour conversion (an emphasis that is shared with SeeColor). The rationale for the swiping method is to increase the amount of information that is converted per unit of time, without sacricing interpretability (i.e. without creating cacophony). II. L UMINOPHONICS & C ROSS M ODALITY C ONVERSIONS The core question of cross modality conversions consists of determining how much information was preserved after the conversion. Apart from representation issues (e.g. dimensionality, statistical and compositional structure, and so on) and the determination of what is relevant or not, cross modality conversions also suffer from noise, missing data and related issues. The basic concept of Luminophonics is to develop a highly congurable and customizable platform that supports sandboxing and generic components for the audio and vision components. Through several experiments, involving different visual to auditory mappings, we hope to discover the most effective set of methods, yielding the best information preservation across modalities. The denition of good information preservation includes not only a rich and correct pairing between visual and auditory properties, but also, ease of use, ease of learning, informational relevance, and pleasantness (i.e. aesthetic value). The generic platform (Figure 1) consists of several blocks of module, each dedicated for their own task from image segmentation to sound synthesis. Figure 1 shows that an input image will go into computer for Image Segmentation then Colour Heuristic Model. After that, both the modules will send out their outputs into Decision Module before Sound Synthesizer can produce the result of the conversion. By having a generic platform, different methods can be rapidly developed for proving or disproving hypotheses regarding information preservation, perceptual limits, usability, and so on. In order to conduct preliminary comparisons between methods a standard measure of information conversion will be proposed along with statistical results from users experiments.

I. I NTRODUCTION In 2003, the World Health Organization reported that there were about 314 million people that were visually impaired worldwide with 45 million of them completely blind [6]. This fact constitutes the core motivation behind Luminophonics, which is a solution that aims to help the blind regain their ability to interpret the visual world through visual to auditory sensory substitution (or image sonication). Because this is not a new area of research, we propose to address current limitations, mainly in the sense of maximizing the information transfer from one modality (i.e. visual) to the other (i.e. auditory). Probably the earliest successful sensory substitution solution for the blind consists of Braille [4]. Image sonication has been a research area for several years now. Some of the older solutions include vOICe [5] and SonART [1] whereas one of the most recent solutions consists of SeeColor [2]. Although these and related solutions have shown to be successful to some degree, they tend to exhibit some or all of the following weaknesses: information loss, cacophony (uninterpretability), lack of congurability and usability, and steep learning curves. Luminophonics aims to nd solutions to these weaknesses, because they hinder the application of image sonication in real world situations. The Luminophonics project covers experimental methods for testing different mappings between visual to auditory properties. It is also concerned with studying aspects of human perception (e.g. parallel processing limits) in order to address the conversion maximization issue in a realistic

978-1-4244-6502-6/10/$26.00 c 2010 IEEE

CIS 2010

Fig. 1.

Generic Platform Process Flow

III. S WIPING M ETHOD Part of the inspiration for the swiping method is derived from human visual attention. Even though the human eye provides the brain with a wide eld of view, visual perception is mostly limited to a focused subset of the total eld. Similarly to human visual attention, which continuously and dynamically repositions and recongures itself in order to process relevant information, the swiping method dynamically repositions a horizontal attention band in order to gradually transfer information for sequential processing. In the swiping method, an image is split into left and right halves for image segmentation purposes. From the segmentation results, blobs are categorized and stored from top to bottom in a buffer. From the buffer, the blobs are processed according to their properties to form a two channel audio wave output. Blobs are converted into sound, band by band in a top down sequence. Each intra and inter band delay is congurable so that each user can optimize the recognition of images according to their individual (perceptual) differences. The main advantage of generating a soundscape (auditory mapping) by sequentially swiping bands from top to bottom is to clearly represent Y-axis information based on time delays. The usage of time delays helps to improve human sound localization along the Y-axis. One might say that the loss of spatial representational freedom (resolution and concurrency), and the resulting difculties in object localization, constitute the central problems or limitations of visual to auditory sensory substitution. In an image, the data comes in 2 dimensions and stereoscopy enables a third dimension (i.e. depth information). Although swiping has been used before (both vertically and horizontally) in order to address this spatial issue, to the best of our knowledge this is the rst time that it has been combined with intermediate level image processing and an emphasis on colour processing.

location (in pixel coordinates).

[Input]

[Binary]

[K-means]
Fig. 2.

[Output]
Process of Segmentation

In this method, the input images are converted into binary images for the purpose of image segmentation by using a contour tracing technique. To simplify the image further, K-means algorithm is used to convert the binary image into multiple segments. After going through 5 iterations of K-means algorithm (refer to Figure 2), noises and incorrect falses are greatly reduced hence it is able to increase the accuracy of nal segmentation. In order to preserve the colour information for later conversion, the input images are kept in pairs with their binary counterparts in a memory buffer. The original colour images are stored for colour extraction while the binary images are used for image segmentation. Two regions-of-interest (ROI) namely left region and right region are extracted from the binary image by splitting the image into two parts of equal size. The function of splitting the image not only speeds up the process of image segmentation but also caters to the 2 channel stereo audio aspect of the solution. Two different audio outputs are produced for both ears. Assuming that earphones are being used, the left ear can only hear the objects that exist on the left ROI while the right ear can only hear the sounds of objects in the right ROI. If an object exists on both sides, its corresponding audio properties can be heard on both ears. This helps the user to determine the X-axis of the objects. From the image segmentation stage, a set of blobs with

IV. I MAGE S EGMENTATION To obtain image blobs and their properties, Luminophonics applies an image segmentation technique which works in linear time, and simultaneously labels connected components and their contours [3]. Using the technique not only provides a fast and real-time segmentation for simple images but also generates blob properties like size (in width and height) and

their properties are arranged in linear format. Refering to Figure 2, the nal output is shown at the bottom right corner with the red boxes illustrating the blobs to be classied. After this, a Heuristic Colour Model classies the blobs based on their colours. The blobs are subsequently arranged in a buffer in a swiping compatible format before they are nally converted into audio wave. V. H EURISTIC C OLOUR M ODEL CCD (charge coupled device) and CMOS (complementary metal oxide semiconductor) camera sensors both vary in sensitivity and colour management models. The subjective interpretation of colour also varies between individuals. Hence, a standard heuristic colour model is needed for each individual camera to deduce standard colour properties based on user settings. By assessing how users perceive colour, a colour model (based on exible thresholds) can be created for particular camera/user pairs. The Heuristic Colour Model (HCM) is strictly based on the HSL colour model which consists of a double symmetrical cone with a true black point at the bottom and white colour at the other end. H represents hue, S represents saturation while L represents Lightness. To determine the mapping of a pixel value, HCM rst tests the pixels saturation and then based on this decides whether to test its hue or lightness.

VI. V ISUAL TO AUDIO M APPING The primary visual properties being investigated in this early phase of the Luminophonics project consist of object colour, size and location. These properties are converted and synthesized dynamically into audio waves by manipulating different audio properties. It is crucial to understand and maximize the dataset from both audio and visual domain to produce a good visual to audio mapping [7]. An intuitive mapping of visual to auditory properties not only helps users learn how to use the technology more rapidly and effectively but also maximizes the information preservation across modalities. A. Colour Blob pixels, which are captured by standard imaging sensors, are by default Red Green Blue values (RGB). These values are converted into the Hue Saturation Lightness (HSL) colour space. Blobs are then categorized into different colours through the Heuristic Colour Model by computing the mean of the blob pixels. In the rst Luminophonics prototype reported here, blobs are categorized into 10 colours namely: Red, Orange, Yellow, Green, Blue, Violet, Indigo and Black, White, Gray. In a manner analogous to the SeeColor approach, the 10 different colours are then be mapped to 10 different timbres (or musical instruments). Besides categorizing blobs into distinct colours, the mean HSL value of a blob provides further auditory variations through the lightness value, which affects the frequency (or pitch) of each timber (or instrument). 1) Timbre: Each colour is mapped to a different and distinctive timbre. In order to satisfy the requirements of timbre distinctiveness, usability (e.g. ease of learning), and aesthetic satisfaction, as with the SeeColor approach, we have chosen different musical instruments to represent different classes of timbre. Table 1 depicts the particular colour to timbre mapping currently being used. This mapping is an obvious target for congurability, seeing that users are likely to have their own aesthetic preferences. 2) Frequency: Though there are 10 types of colours matching with 10 different musical instruments, the lightness value of the blob can be encoded in the auditory signal by affecting the frequency (pitch) of the musical instrument, thus further expanding the range of sounds that can be experienced by the user. This encoding allows users to differentiate two blobs with the same colour but with different lightness values. B. Location The 2D coordinates of each blob are extracted from the contour segmentation technique in x and y pixel units.

Fig. 3.

Heuristic Colour Model Decision Chart

HCM is a simple and intuitive method for determining whether a particular pixel should be determined as a colour or grayscale pixel. For a colour pixel, its saturation must be between the upper threshold and lower threshold of the saturation scale. For a grayscale pixel, its saturation is either above the upper threshold or below the lower threshold of the saturation scale. Refer to Figure 3 for a diagram of HCMs decision logic.

Colours Red Orange Yellow Green Blue Indigo Violet White Gray Black

Instrument Saxophone Cello Harmonica Piano Horn Guitar Trumpet Xylophone Flute Violin

and a background in music, was used. Before the actual testing phase, the subject was asked to go through a training phase, where he learnt to recognize soundscapes generated by the prototype. A. Training The training was limited to the basic mappings (or features) implemented by the prototype, namely: colour, location and size. The participant was trained to recognize the three basic features and was expected to recognize combinations of these, in order to achieve the objective of each experiment. The duration of each training session depended on the satisfaction of the participant. The participant was allowed to repeat sessions any number of times until he was satised with his ability to recognize each basic feature. The rst training session was to recognize the colour of the input image through its corresponding sound (i.e. timbre). The participant was also trained to differentiate shades of colour, though the resulting variations in pitch. To train the participant to recognize the location of an object, the participant was taught to relate time delays to vertical position and stereo placement to horizontal position. Four blobs of the same colour were drawn on four different input images in four different quadrants: Top-Left, Top-Right, Bottom-Left and Bottom-Right. The participant was required to listen to the resulting soundscape and interpret the position of blobs. Training to recognize blob size was relatively easier, because of the simple relationship between object size and sound volume. Blobs of different sizes were drawn on to different input images. The participant was asked to listen closely to the different sound volumes and to recognize blob sizes based on these differences. B. Experiments Six different experiments were conducted on the participant after the training phase. The difculty and complexity levels of the experiments were low with a maximum of two feature combinations per test. Due to the preliminary nature of the testing, the sample images consisted of simple synthetic objects. Figure 4 shows some of the test samples used in conducting the experiments. 1) Colour Test: The colour test was the simplest experiment of all, where the participant was required to differentiate between the 10 different possible blob colours. One object with one colour was showed every time and the participant was required to identify the colour through the resulting sound. The process was repeated for 20 times to roughly gauge the accuracy of the participants color

TABLE I C OLOUR M APPING TABLE

1) X-axis: The x-coordinate of a blob is encoded through stereophonic sound. The x-axis is represented by three regions, i.e.: left, right and both sides. Before audio conversion, Luminophonics determines in which regions blobs reside in. Audio synthesis is based on which region a blob is located in. For example, if the blob falls entirely in the left region, sound will be synthesized only in the left audio channel. The converse applies to blobs falling entirely in the right region. If blobs fall across both regions then sound is synthesized in both audio channels. 2) Y-axis: The y-coordinate of a blob is implicitly mapped by the time delay of each band. One of the main purposes of creating the swiping method is to preserve the Y-axis information. Without a surround sound system or a highly trained ear, the y-coordinate of an object is very hard to represent. In the swiping method, the y-coordinate of a blob is associated with time delay in an intuitive manner. The further down a blob is in the image, the longer it takes for its corresponding sound to be synthesized. Thus, users can used this temporal information to create a mental image of the relative positioning of blobs along the y-axis. C. Size In the current implementation of Luminophonics, the size (or area) of a blob is encoded in terms of the volume of the instrument being played. In other words, the volume of a particular instrument is directly proportional to the size of the blob generating it. This visual to auditory association, apart from allowing users to infer the size of objects, also allows them to infer their horizontal skewness. If a blob is skewed to the left, the area of that part of the blob in the left region will be larger than the area in the right region. Hence, the volume of the particular audio in the left channel will be higher than the volume of the same audio in the right channel. VII. T RAINING & P RELIMINARY T EST Several basic preliminary tests were conducted, not only to validate the software developed but also to verify the validity of our sensory-substitution approach. Due to the preliminary nature of these tests a single subject, with adequate hearing

quadrant and asked to identify the object located in it. 24 different combinations of images were shown randomly to the participant. 6) Counting Test: The nal experiment was the most complex of the six reported here, where the participant (and the prototype) were tested on all three visual features (i.e. location, size and colour). In Figure 4, the bottom left image with the label Rounds is one of the test image where user needs to count the number of round objects in 3 different colours. Different shapes, with different sizes, in different colours were drawn on an image. Every image had different numbers of blobs located randomly in a particular location. The participant was required to count the number of blobs on the image. The process was repeated for 10 times.
Fig. 4. Test Samples

VIII. R ESULTS identication. 2) Object Test: The objective of this experiment was to test whether our prototype produced adequate auditory features for recognizing four different object classes. The four objects were randomly displayed in different sequences but 5 times each (each object class consisted of ve variations each). In total, the participant needed to recognize the objects in 20 trials. The four object classes (i.e. bee, house, stickman and tree) were pre-selected partly based on their distinctive features. The bee was selected due to its combination of black and yellow features. Images of the house category were drawn specically from a square and a triangle of different colours. The participant was expected to recognize the object through the triangle/square arrangement despite their specic colours. Stickman images were drawn in only one colour with a round head at the top and a body made out of lines. Tree images were drawn with simplied green foliage and a brown trunk. 3) Shate Test: As mentioned earlier, different colour shades produce different pitches for the same musical instrument. In the shade test, two blobs of the same colour but with different shades were drawn side by side to let the user pick the darkest blob based on pitch. In this experiment, the participants ability to discriminate horizontal locations and shades of colour was tested. 4) Find Location: In this experiment, one variant from the four object classes (mentioned in the object test) was selected and redrawn in four different image quadrants. The participant was asked to locate the position of a specic object in one of the four quadrants. 24 different combinations of images were shown randomly to the users. 5) Find Object: This experiment is closely related to the previous one. In this case, the participant was given a specic

Fig. 5.

Preliminary Experiments Accuracy

The bar chart in Figure 5 shows the recognition rates of the participant for the different experimental conditions. Based on the fact that for most conditions the chance level rests at 25% (for the colour and shade tests chance levels are 10% and 50% respectively), these results suggest that the prototype is indeed functioning as expected. The participant obtained the lowest accuracy for the experiment that required him to identify an object within a given quadrant. This result might be partly explained by the phenomenon of cacophony which was more pronounced in this condition. In contrast, and maybe ironically, the participant obtained the highest accuracy in the experiment that required him to nd the location of a given object. The fact that the subject limited his focus to a quadrant at a time and compared this soundscape to that stored in his memory might explain the larger accuracy rate. In the counting test, the participant obtained a high accuracy rate with 90% correct answers. This suggests that swiping from top to bottom is a good approach for users to create mental maps of blob locations and numbers. The participant was capable of drawing the location of blobs on a

graph, while counting them one by one. IX. C ONCLUSION AND F UTURE E NHANCEMENTS The swiping method aims to improve current visual to auditory sensory substitution solutions by maximizing information conversion whilst maintaining learnability and interpretability. The visual properties of colour, size and location, are explicitly encoded and converted into auditory signals. Due to the attentional (banding) and temporal (swiping) aspects of the solution, information about shape and texture can be deduced from the soundscape. Through training and/or repeated utilization, users should be capable of interpreting increasing amounts of information with a reduced sense of cacophony, and should exhibit faster recognition and learning rates. Having said this, a signicant amount of work remains to be done. One of the most immediate future tasks involves the image processing stage, whereby we hope to generate simplied visual descriptions (through modied segmentation algorithms) that aim to cohesively extract the most relevant aspects of a visual scene. Future work will also involve the completion of a highly congurable prototype, which apart from allowing users to ne-tune the system to their tastes and requirements, will allow us to conduct extensive experimentation in order to conclusively answer the question of how to maximize cross-modal information conversions. This effort will also involve an investigation of human perceptual capabilities and limitations. Quantitative conversion measures also need to be developed in order to facilitate preliminary comparisons between approaches. Different variations of the solution are expected to be developed for different contexts, e.g.: navigation (e.g. walking in a shopping mall) vs. human computer interaction (e.g. trying to interpret a graph). Due to the importance of depth information, particularly in the context of navigation, future versions of the approach should incorporate a stereo camera. The auditory encoding of depth should be done in a manner that is distinctive and does not interfere with the mapping already provided for colour, position and size. In conclusion, preliminary results indicate that the image processing and attentional dynamic approach adopted by Luminophonics is valid and thus that its aims to further maximize the quantity, rate, learnability and interpretability of the information converted are within reach, and consequently that its ultimate goal to provide effective technology to assist the visually impaired in their real-life interactions with the environment, is also attainable. R EFERENCES
[1] Ben-Tal, O., Berger, J., Cook, B., Daniels, M., Scavone, G., and Cook, P. Sonart: The sonication application research. In Proceedings of the 2002 International Conference on Auditory Display, page 151, Kyoto, Japan, Jul 2002.

[2] Bologna, G., Deville, B., Pun, T., and Vinckenbosch, M. Transforming 3d coloured pixels into musical instrument: Notes for vision substitution applications. EURASIP Journal on Image and Video Processing, 2007. [3] Chang, F., Chen, C.J., and Lu, C.J. A linear-time component-labeling algorithm using contour tracing technique. CVIU, 93(2):206220, February 2004. [4] Grant, A. C., Thiagarajah, M. C., and Sathian, K. Tactile perception in blind braille readers: A psychophysical study of acuity and hyperacuity using gratings and dot patterns. Perception & Psychophysics, pages 301 312, 2000. [5] Meijer, P. An experimental system for auditory image representations. IEEE Transactions on Biomedical Engineering, 39:112122, 1999. [6] World Health Organisation. Up to 45 million blind people globally - and growing. Retrieved from World Health Organization Ofcial Website: http://www.who.int/mediacentre/news/releases/2003/pr73/en/, Oct 2003. [7] Yeo, W. S. and Berger, J. Application of raster scanning method to image sonication, sound visualization, sound analysis and synthesis. In 9th Int. Conference on Digital Audio Effects (DAFx-06), 2006.

You might also like