IEEE Standard For The Perceptual Quality Assessment of Three-Dimensional (3D) and Ultra-High-Definition (UHD) Contents

IEEE Standard for the
Perceptual Quality Assessment

of Three-Dimensional (3D) and
Ultra-High-Definition (UHD) Contents
IEEE Computer Society
Sponsored by the
Standards Activities Board
IEEE
3 Park Avenue IEEE Std 3333.1.2™-2017
New York, NY 10016-5997
USA
Authorized licensed use limited to: CONACYT CONSEJO NACIONAL DE CIENCIA Y TECNOLOGIA. Downloaded on July 21,2018 at 19:59:01 UTC from IEEE Xplore. Restrictions apply.
IEEE Std 3333.1.2™-2017

Sponsor
Standards Activities Board

of the
IEEE Computer Society
Approved 6 December 2017
IEEE-SA Standards Board
Abstract: The world is witnessing a rapid advance in stereoscopic 3D (S3D), and ultra-high-
definition (UHD) technology. As a result, the need for accurate quality and visual-comfort
assessment techniques to foster the display device industry as well as signal-processing area. In
this standard, thorough assessments with respect to the human visual system (HVS) for S3D and
UHD contents shall be presented. Moreover, several image and video databases are also publicly
provided for any research purpose.
Keywords: accommodation and vergence conflict, foveation, human visual system, HVS, IEEE
3333.1.2™, quality assessment, QoE, quality of experience, saliency detection, stereoscopic,
stereoscopic display, subjective assessment, UHD, UHD display, ultra-high definition, visual
contents analysis, visual comfort, visual discomfort
The Institute of Electrical and Electronics Engineers, Inc.

3 Park Avenue, New York, NY 10016-5997, USA
Copyright © 2018 by The Institute of Electrical and Electronics Engineers, Inc.

All rights reserved. Published 28 June 2018. Printed in the United States of America.
IEEE is a registered trademark in the U.S. Patent & Trademark Office, owned by The Institute of Electrical and Electronics Engineers,
Incorporated.
PDF: ISBN 978-1-5044-4659-4 STD22981

Print: ISBN 978-1-5044-4660-0 STDPD22981
IEEE prohibits discrimination, harassment, and bullying.

For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.
No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written permission
of the publisher.
2
Copyright © 2018 IEEE. All rights reserved.
Important Notices and Disclaimers Concerning IEEE Standards Documents
IEEE documents are made available for use subject to important notices and legal disclaimers. These
notices and disclaimers, or a reference to this page, appear in all standards and may be found under the
heading “Important Notices and Disclaimers Concerning IEEE Standards Documents.” They can also be
obtained on request from IEEE or viewed at http://standards.ieee.org/IPR/disclaimers.html.
Notice and Disclaimer of Liability Concerning the Use of IEEE Standards

Documents
IEEE Standards documents (standards, recommended practices, and guides), both full-use and trial-use, are
developed within IEEE Societies and the Standards Coordinating Committees of the IEEE Standards
Association (“IEEE-SA”) Standards Board. IEEE (“the Institute”) develops its standards through a
consensus development process, approved by the American National Standards Institute (“ANSI”), which
brings together volunteers representing varied viewpoints and interests to achieve the final product. IEEE
Standards are documents developed through scientific, academic, and industry-based technical working
groups. Volunteers in IEEE working groups are not necessarily members of the Institute and participate
without compensation from IEEE. While IEEE administers the process and establishes rules to promote
fairness in the consensus development process, IEEE does not independently evaluate, test, or verify the
accuracy of any of the information or the soundness of any judgments contained in its standards.
IEEE Standards do not guarantee or ensure safety, security, health, or environmental protection, or ensure
against interference with or from other devices or networks. Implementers and users of IEEE Standards
documents are responsible for determining and complying with all appropriate safety, security,
environmental, health, and interference protection practices and all applicable laws and regulations.
IEEE does not warrant or represent the accuracy or content of the material contained in its standards, and
expressly disclaims all warranties (express, implied and statutory) not included in this or any other
document relating to the standard, including, but not limited to, the warranties of: merchantability; fitness
for a particular purpose; non-infringement; and quality, accuracy, effectiveness, currency, or completeness
of material. In addition, IEEE disclaims any and all conditions relating to: results; and workmanlike effort.
IEEE standards documents are supplied “AS IS” and “WITH ALL FAULTS.”
Use of an IEEE standard is wholly voluntary. The existence of an IEEE standard does not imply that there
are no other ways to produce, test, measure, purchase, market, or provide other goods and services related
to the scope of the IEEE standard. Furthermore, the viewpoint expressed at the time a standard is approved
and issued is subject to change brought about through developments in the state of the art and comments
received from users of the standard.
In publishing and making its standards available, IEEE is not suggesting or rendering professional or other
services for, or on behalf of, any person or entity nor is IEEE undertaking to perform any duty owed by any
other person or entity to another. Any person utilizing any IEEE Standards document, should rely upon his
or her own independent judgment in the exercise of reasonable care in any given circumstances or, as
appropriate, seek the advice of a competent professional in determining the appropriateness of a given
IEEE standard.
IN NO EVENT SHALL IEEE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO:
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE PUBLICATION, USE OF, OR RELIANCE
UPON ANY STANDARD, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE AND
REGARDLESS OF WHETHER SUCH DAMAGE WAS FORESEEABLE.
3
Translations
The IEEE consensus development process involves the review of documents in English only. In the event
that an IEEE standard is translated, only the English version published by IEEE should be considered the
approved IEEE standard.
Official statements
A statement, written or oral, that is not processed in accordance with the IEEE-SA Standards Board
Operations Manual shall not be considered or inferred to be the official position of IEEE or any of its
committees and shall not be considered to be, or be relied upon as, a formal position of IEEE. At lectures,
symposia, seminars, or educational courses, an individual presenting information on IEEE standards shall
make it clear that his or her views should be considered the personal views of that individual rather than the
formal position of IEEE.
Comments on standards
Comments for revision of IEEE Standards documents are welcome from any interested party, regardless of
membership affiliation with IEEE. However, IEEE does not provide consulting information or advice
pertaining to IEEE Standards documents. Suggestions for changes in documents should be in the form of a
proposed change of text, together with appropriate supporting comments. Since IEEE standards represent a
consensus of concerned interests, it is important that any responses to comments and questions also receive
the concurrence of a balance of interests. For this reason, IEEE and the members of its societies and
Standards Coordinating Committees are not able to provide an instant response to comments or questions
except in those cases where the matter has previously been addressed. For the same reason, IEEE does not
respond to interpretation requests. Any person who would like to participate in revisions to an IEEE
standard is welcome to join the relevant IEEE working group.
Comments on standards should be submitted to the following address:
Secretary, IEEE-SA Standards Board

445 Hoes Lane
Piscataway, NJ 08854 USA
Laws and regulations
Users of IEEE Standards documents should consult all applicable laws and regulations. Compliance with
the provisions of any IEEE Standards document does not imply compliance to any applicable regulatory
requirements. Implementers of the standard are responsible for observing or referring to the applicable
regulatory requirements. IEEE does not, by the publication of its standards, intend to urge action that is not
in compliance with applicable laws, and these documents may not be construed as doing so.
Copyrights
IEEE draft and approved standards are copyrighted by IEEE under U.S. and international copyright laws.
They are made available by IEEE and are adopted for a wide variety of both public and private uses. These
include both use, by reference, in laws and regulations, and use in private self-regulation, standardization,
and the promotion of engineering practices and methods. By making these documents available for use and
adoption by public authorities and private users, IEEE does not waive any rights in copyright to the
documents.
4
Photocopies
Subject to payment of the appropriate fee, IEEE will grant users a limited, non-exclusive license to
photocopy portions of any individual standard for company or organizational internal use or individual,
non-commercial use only. To arrange for payment of licensing fees, please contact Copyright Clearance
Center, Customer Service, 222 Rosewood Drive, Danvers, MA 01923 USA; +1 978 750 8400. Permission
to photocopy portions of any individual standard for educational classroom use can also be obtained
through the Copyright Clearance Center.
Updating of IEEE Standards documents
Users of IEEE Standards documents should be aware that these documents may be superseded at any time
by the issuance of new editions or may be amended from time to time through the issuance of amendments,
corrigenda, or errata. A current IEEE document at any point in time consists of the current edition of the
document together with any amendments, corrigenda, or errata then in effect.
Every IEEE standard is subjected to review at least every ten years. When a document is more than ten
years old and has not undergone a revision process, it is reasonable to conclude that its contents, although
still of some value, do not wholly reflect the present state of the art. Users are cautioned to check to
determine that they have the latest edition of any IEEE standard.
In order to determine whether a given document is the current edition and whether it has been amended
through the issuance of amendments, corrigenda, or errata, visit the IEEE Xplore at
http://ieeexplore.ieee.org/ or contact IEEE at the address listed previously. For more information about the
IEEE-SA or IEEE’s standards development process, visit the IEEE-SA Website at http://standards.ieee.org.
Errata
Errata, if any, for all IEEE standards can be accessed on the IEEE-SA Website at the following URL:
http://standards.ieee.org/findstds/errata/index.html. Users are encouraged to check this URL for errata
periodically.
Patents
Attention is called to the possibility that implementation of this standard may require use of subject matter
covered by patent rights. By publication of this standard, no position is taken by the IEEE with respect to
the existence or validity of any patent rights in connection therewith. If a patent holder or patent applicant
has filed a statement of assurance via an Accepted Letter of Assurance, then the statement is listed on the
IEEE-SA Website at http://standards.ieee.org/about/sasb/patcom/patents.html. Letters of Assurance may
indicate whether the Submitter is willing or unwilling to grant licenses under patent rights without
compensation or under reasonable rates, with reasonable terms and conditions that are demonstrably free of
any unfair discrimination to applicants desiring to obtain such licenses.
Essential Patent Claims may exist for which a Letter of Assurance has not been received. The IEEE is not
responsible for identifying Essential Patent Claims for which a license may be required, for conducting
inquiries into the legal validity or scope of Patents Claims, or determining whether any licensing terms or
conditions provided in connection with submission of a Letter of Assurance, if any, or in any licensing
agreements are reasonable or non-discriminatory. Users of this standard are expressly advised that
determination of the validity of any patent rights, and the risk of infringement of such rights, is entirely
their own responsibility. Further information may be obtained from the IEEE Standards Association.
5
Participants
At the time this standard was completed, the Human Factors for Visual Experiences Working Group had
the following membership:
Sanghoon Lee, Chair
Sewoong Ahn Hezerul Abdul Karim Maria G. Martini

Federica Battisti Haksub Kim Anh-Duc Nguyen
Marco Carli Jinwoo Kim Heeseok Oh
Sungho Cho Jongyoo Kim An Ping
Ricky Christanto SeongYong Kim Liquan Shen
SangKwon Jeong Woojae Kim Min Zhang
Xin Jin Patrick Le Callet Yu Zhang
The following members of the individual balloting committee voted on this standard. Balloters may have
voted for approval, disapproval, or abstention.
Iwan Adhicandra Juan Carreon Piotr Karocki

Ping An Randall Groves SeongYong Kim
Margaret Belska Werner Hoelzl Sanghoon Lee
Demetrio Bucaneg Jr. Noriyuki Ikeuchi R. K. Rannow
SangKwon Jeong
When the IEEE-SA Standards Board approved this standard on 6 December 2017, it had the following
membership:
Jean-Philippe Faure, Chair

Gary Hoffman, Vice Chair
John D. Kulick, Past Chair
Konstantinos Karachalios, Secretary
Chuck Adams Thomas Koshy Robby Robson

Masayuki Ariyoshi Joseph L. Koepfinger* Dorothy Stanley
Ted Burse Kevin Lu Adrian Stephens
Stephen Dukes Daleep Mohla Mehmet Ulema
Doug Edwards Damir Novosel Phil Wennblom
J. Travis Griffith Ronald C. Petersen Howard Wolfman
Michael Janezic Annette D. Reilly Yu Yuan
*Member Emeritus
6
Introduction
This introduction is not part of IEEE Std 3333.1.2-2017, IEEE Standard for the Perceptual Quality Assessment of
Three-Dimensional (3D) and Ultra-High-Definition (UHD) Contents.
This standard is specifically written for high-end 3D and UHD service purveyors, 3D and UHD display
makers, and the 3D digital cinema industry.
History
On 18 February 2011, a project was launched to develop IEEE P.3333.1™, Quality Assessment of Three
Dimensional (3D) Contents Based on Psychophysical Studies Working Group (WG), later changed to
Human Factors for Visual Experiences Working Group, and a PAR was submitted. The IEEE Standards
Association Corporate Advisory Group approved it on 8 March 2011, and the IEEE Standards Association
Standards Board approved it by 31 March 2011. On 5 April 2011 the project IEEE P3333.1 was official
approved. Finally, the WG called for participation on 27 May 2011. On 29 March 2014, IEEE P3333.1 split
in two groups (IEEE P3333.1.1™ and IEEE P3333.1.2™).
IEEE P3333.1.1, IEEE Draft Standard for the Quality of Experience (QoE) and Visual-Comfort
Assessments of Three-Dimensional (3D) Contents Based on Psychophysical Studies: the PAR’s request
date was 6 November 2013, and its approval date was 26 March 2014. This standard establishes methods of
visual discomfort and quality-of-experience (QoE) assessments of 3D contents based on psychophysical
studies. These key factors are constructed in conjunction with the visual factors used to provide visual
discomfort and QoE degradation. On 10 July 2015 the standard was officially published.
IEEE P3333.1.2, IEEE Draft Standard for the Perceptual Quality Assessment of Three-Dimensional (3D)
and Ultra-High-Definition (UHD) Contents: the PAR’s request date was 6 November 2013, and its
approval date was 27 March 2014. Originally, the title was IEEE Draft Standard for the Perceptual Quality
Assessment of Three Dimensional (3D) Contents Based on Physiological Mechanisms. However, on
4 March 2016 after being re-approved, the official name of the PAR changed to the current version. This
standard establishes methods of quality assessment of 3D and UHD contents based on human visual system
analysis, such as perceptual quality and visual attention.
7
Contents
1. Scope .......................................................................................................................................................... 9
2. Normative references.................................................................................................................................. 9
3. Definitions, acronyms, and abbreviations ................................................................................................ 10

3.1 Definitions ......................................................................................................................................... 10
3.2 Acronyms and abbreviations ............................................................................................................. 11
4. An overview of the standard..................................................................................................................... 12

4.1 Quality assessment for 3D contents ................................................................................................... 12
4.2 Quality assessment for ultra-high-definition contents ....................................................................... 12
5. Quality assessment for 3D contents.......................................................................................................... 13

5.1 General .............................................................................................................................................. 13
5.2 Full-reference quality assessment of stereoscopic images using disparity-gradient-phase
similarity ............................................................................................................................................ 13
5.3 Reduced reference real-time quality assessment of stereoscopic images/video ................................ 16
5.4 Analysis of human 3D perception in stereo images ........................................................................... 20
5.5 Spatial quality pooling based on human fine 3D perception ............................................................. 25
5.6 Objective image-quality assessment of 3D synthesized views .......................................................... 27
5.7 A 3D subjective quality-prediction model based on depth distortion ................................................ 29
6. Quality-of-experience assessment for ultra-high-definition contents based on the human visual

system ...................................................................................................................................................... 33
6.1 General .............................................................................................................................................. 33
6.2 Blind sharpness prediction for ultra-high-definition videos based on the human visual system ....... 33
6.3 Visual preference assessment on ultra-high-definition (UHD) images ............................................. 39
7. The impact of compression and packet losses on 3D perception ............................................................. 43

7.1 General .............................................................................................................................................. 43
7.2 Description of the proposed 3D video database................................................................................. 43
7.3 Subjective experiments ...................................................................................................................... 44
8. A database of stereoscopic images plus depth .......................................................................................... 47

8.1 Description of the proposed stereoscopic image database plus depth ............................................... 47
8.2 Subjective experiment ....................................................................................................................... 49
Annex A (informative) Bibliography ........................................................................................................... 52
8
1. Scope
This standard establishes methods for quality assessment of 3D and UHD contents based on physiological
mechanisms such as perceptual quality and visual attention. This standard identifies and quantifies the
following causes and visual attention of perceptual quality degradation for 3D and UHD image and video
contents:
 Compression distortion, such as multi-view image and video compression

 Interpolation distortion by intermediate-view rendering, such as 3D and UHD warping, view
synthesis
 Structural distortion, such as bit errors on wireless/wired transmission errors
 Visual attention according to the quality degradation
Key items are needed to characterize the 3D and UHD database in terms of the human visual system. These
key factors are constructed in conjunction with the visual factors used to perceive quality and visual
attention.
2. Normative references
The following referenced documents are indispensable for the application of this document (i.e., they must
be understood and used, so each referenced document is cited in text and its relationship to this document is
explained). For dated references, only the edition cited applies. For undated references, the latest edition of
the referenced document (including any amendments or corrigenda) applies.
Recommendation ITU-P.910, Subjective Video Quality Assessment Methods for Multimedia

Applications. 1
Recommendation ITU-R BT.500-12, Methodology for the Subjective Assessment of the Quality of
Television Pictures.
1
ITU-T publications are available from the International Telecommunications Union (http://www.itu.int/).
9
IEEE Std 3333.1.2-2017
IEEE Standard for the Perceptual Quality Assessment of Three-Dimensional (3D) and
Recommendation ITU-R BT.500-13, Methodology for the Subjective Assessment of the Quality of
Television Pictures.
Recommendation ITU-R BT.2021, Subjective Methods for the Assessment of Stereoscopic 3DTV Systems.
3. Definitions, acronyms, and abbreviations
3.1 Definitions
For the purposes of this document, the following terms and definitions apply. The IEEE Standards
Dictionary Online should be consulted for terms not defined in this clause. 2
accommodation: The process by which the vertebrate eye changes optical power to maintain a clear image
or focus on an object as its distance varies.
disparity: The difference in image location of an object seen by the left and right eyes, resulting from the
eyes’ horizontal separation (parallax). Disparity map implies the difference by number of pixels.
cyclopean image: The image which is fused and perceived by the human visual system.
fovea: Part of the eye located in the center of the macular area of the retina.
human visual system: The part of the central nervous system which gives organisms the ability to process
visual detail, as well as enabling the formation of several non-image photo-response functions.
mean opinion score: In a subjective quality assessment, the average of all subjects’ rating scores of quality
of experience.
natural scene statistics: It is concerned with the statistical regularities related to natural scenes.
photoreceptor: A specialized type of cell found in the retina that is capable of visual phototransduction.
The great biological importance of photoreceptors is that they convert light (visible electromagnetic
radiation) into signals that can stimulate biological processes.
quality assessment: Evaluation of quality of the service or product to determine the performance in
relation to set standards.
quality of experience: The degree of delight or annoyance of the user with an application or service. It
results from the fulfillment of his or her expectations with respect to the utility and/or enjoyment of the
application or service in the light of the user’s personality and current state.
retina: The sensitive layer of tissue at the back of the eye where sensory images are received.
saliency: Interested regions on image or video that is the state or quality by which it stands out relative to
other areas.
segmentation: The act of segmenting an image into different meaningful regions to understand its contents.
sharpness: This is defined by the boundaries between zones of different tones or colors.
2
IEEE Standards Dictionary Online subscription is available at: http://dictionary.ieee.org.
10
IEEE Std 3333.1.2-2017
stereoscopic display: Display device capable of inducing depth perception in viewers by means of
stereopsis for binocular parallax.
subjective assessment: An assessment of quality or visual discomfort where there is no pre-established

objective measure or standard and is thus based solely on the opinion of the evaluator or a group of
observers.
vergence: Simultaneous movements of both eyes in opposite directions to obtain or maintain single
binocular vision.
viewing distance: Distance between the viewer and display.
visual (dis)comfort: The state of mind that expresses (dis)satisfaction with the visual environment. In other
words, the visual (dis)comfort is a subjective sensation which accompanies physiological changes, in this
case the ease (stress) of viewing 3D content. As a subjective sensation, it can be measured by asking the
viewer to report its level. The positive (negative) sensation usually increases (decreases) rapidly when the
human observer gazes at a comfortable scenario or closes his eyes.
3.2 Acronyms and abbreviations
2D two-dimensional
3D three-dimensional
3DSwIM 3D synthesized view image-quality metric
CTU coding tree unit
DFT discrete Fourier transform
DGP-SIM disparity-gradient-phase-congruency similarity
DIBR depth-image-based rendering
DMOS differential mean opinion score
DSCQS double stimulus continuous quality scale
FR full reference
HD high definition
HEVC high-efficiency video coding
HVS human visual system
IQA image-quality assessment
LPF low-pass filter
MOS mean opinion score
MSSIM mean structural similarity
NR no reference
NSS natural scene statistics
PLCC Pearson linear correlation coefficient
PLR packet loss rate
11
IEEE Std 3333.1.2-2017
PSNR peak signal-to-noise ratio

QA quality assessment
QoE quality of experience
QP quantization parameter
RGB red, green, and blue channels
RMSE root mean squared error
RR reduced reference
RTP real-time transport protocol
SIQA stereoscopic image-quality assessment
SROCC Spearman’s rank correlation coefficient
SSIM structural similarity
SVR support vector regression
UHD ultra-high definition
4. An overview of the standard
4.1 Quality assessment for 3D contents
The separated perception of left and right views of stereoscopic 3D stimuli induces various binocular-
relevant perception processes in the human visual system, which results in significantly different
determination of the visual quality of the 3D stimuli from that of 2D stimuli. Indeed, existing perceptual
image-quality-assessment models designed for 2D images generally fail to accurately predict quality scores
of 3D images. To analyze and predict the degree of perceptual quality of a 3D content, it is necessary to
understand the contents in terms of spatial and temporal characteristics that are based on existing
psychophysical and statistical models of binocular visual perception. In addition, a perceptual quality-
prediction model of depth-image-based rendering (DIBR)–synthesized views is suggested. Toward this
goal, specific artifacts related to DIBR systems need to be considered.
Mostly, the subjective assessment is inherited from what has been traditionally done for 2D subjective
assessment as defined by ITU-P.910, ITU-R BT.500-13, and ITU-R BT.2021. 3 However, it is doubtful
whether their results are reliable enough to be used as the reference because the viewing environment is
quite different from 2D due to intensive immersion of a user wearing the glasses in dark lighting. Hence, to
perform the 3D image and video assessments, new databases shall be required to cover the characteristics
of human perception, display mechanism, viewing environment, and so on.
4.2 Quality assessment for ultra-high-definition contents
Generally, the UHD systems have much higher resolution, as well as much larger field of view, than HD
systems. As a consequence, it is expected that viewers experience more realistic contents on UHD displays.
This fosters the study of quality-of-experience issues in UHD systems. In particular, sharpness and
preference assessments stand out for the reason that one of the most critical factors discriminating the UHD
and HD experiences is the perceived pixel densities by the viewers.
3
Information on references can be found in Clause 2.
12
IEEE Std 3333.1.2-2017
Many post-processing techniques have been studied in order to satisfy the viewers’ expectation. Classically,
it is believed that users have better enjoyment when the visual contents are sharp and have high contrast.
However, when the sharpness and/or contrast is too high, the contents become unnatural and difficult to be
perceived. As a result, the search for an appropriate sharpness and contrast enhancement level is crucial and
practical.
5. Quality assessment for 3D contents
5.1 General
The interest in stereoscopic image quality has been increasing lately, which is stimulated by both
entertainment industry and academic research (Lee and Lee [B20]). 4 In practice, stereoscopic images can
be degraded by common processes, such as compression or transmission. Similar to monocular cases, there
are three types of image-quality assessment: full-reference, reduced-reference, and no-reference quality
assessment. In this standard, all these methods are explored in succession.
5.2 Full-reference quality assessment of stereoscopic images using disparity-

gradient-phase similarity
5.2.1 General
The flowchart of the 3D image-quality assessment method is shown in Figure 1. First, reference and
distorted cyclopean images are generated by using the reference and distorted stereoscopic images based on
a binocular combination/fusion model, respectively. Then, disparity maps of the reference and distorted
stereo pairs, gradient magnitude maps, and phase congruency maps of reference and distorted cyclopean
images are correspondingly extracted. Similarity measurements of the three features are computed, and
then combined into the final stereoscopic image-quality index.
Figure 1 —Disparity-gradient-phase-congruency similarity (DGP-SIM) framework

for stereoscopic image-quality assessment (SIQA)
4
Numbers in brackets correspond to those of the bibliography in Annex A.
13
IEEE Std 3333.1.2-2017
5.2.2 Cyclopean image
When human eyes watch stereoscopic images, the left and right eyes will respectively receive left and right
view of the stereo pair. Before the human visual system (HVS) processes the visual signal, the stereoscopic
images will be fused into a single image, i.e., the cyclopean image, which can be explained by binocular
vision combination characteristic. Here, the latest biological model, gain-control theory model, is used to
simulate the binocular fusion and explain the cyclopean perception. The simulated cyclopean image
CI ( x, y ) is computed as follows:
CI ( x, y ) WL I L ( x, y ) + WR I R ( x, y )
= (1)
where
WL is the weighting parameters of the left image I L  x, y 

WR is the weighting parameters of the right image I R  x, y 
Since the Gabor filter achieves a high performance in modeling the visual signal–processing mechanism of
a simple cell in the primary visual cortex of HVS, the normalized left and right Gabor filter energy
responses are used to weight the stereo pair respectively. Considering binocular disparity, the simulated
cyclopean image CI ( x, y ) can be expressed as follows:
CI ( x, y=
) WL ( x, y ) I L ( x, y ) + WR ( x, y ) I R ( x − d L (x, y), y ) (2)
GEL ( x, y )
WL = (3)
GEL ( x, y ) + GER ( x − d L ( x, y ), y )
GER ( x − d L ( x, y ), y )
WR = (4)
GEL ( x, y ) + GER ( x − d L ( x, y ), y )
where
GEL is the overall response magnitude of the left image, on all scales and all orientations
GER is the overall response magnitude of the right image, on all scales and all orientations
d L  x, y  is the disparity map
Figure 2 shows an example of a synthesized cyclopean image. The distorted stereoscopic images are
asymmetrically distorted by white noise, as showed in the right-view image. Since the white noise
distortion can increase the right stimulus strength, the overall cyclopean image is dominated by the right
view with white noise.
Figure 2 —Cyclopean image synthesized by a distorted stereo pair; (a) and (b) are left
and right view of stereo pair respectively, (c) is the cyclopean image of (a) and (b)
14
IEEE Std 3333.1.2-2017
5.2.3 Stereoscopic image-quality metric
The DGP-SIM metric has two stages. In the first stage, three local feature similarity maps are computed,
and then the three similarity maps are combined into a single similarity index.
First, the similarity measure for disparity map S DP  x  of the reference and distorted stereo pairs is defined
as follows:
2 DP1 ( x) DP2 ( x) + C1
S DP ( x) = (5)
DP12 ( x) + DP22 ( x) + C1
where
DP1 ( x) is the disparity map of reference stereo pairs

DP2 ( x) is the disparity map of distorted stereo pairs
C1 is 11.25, a positive constant to increase the stability of S DP when the denominator is close to
zero.
The value of S DP is between 0 and 1.
Identically, the similarity measure for gradient magnitude map SGM ( x) of the reference and distorted
cyclopean images is defined as follows:
2GM 1 ( x)GM 2 ( x) + C2
SGM ( x) = (6)
GM 12 ( x) + GM 22 ( x) + C2
where
GM 1  x  represents the gradient magnitude map of reference cyclopean images

GM 2  x  represents the gradient magnitude map of distorted cyclopean image
C2 is 12.25
The similarity measure for phase congruency map S PC ( x) of the reference and distorted cyclopean images
is defined as follows:
2 PC1 ( x) PC2 ( x)  C3
S PC ( x)  (7)
PC12 ( x)  PC22 ( x)  C3
where
PC1  x  is the gradient magnitude map of the reference cyclopean image

PC2  x  is the gradient magnitude map of the distorted cyclopean image
C3 is 0.0085
Then, S DP  x  , SGM  x  , and S PC  x  are combined to get the final similarity index S SIQA  x  , which is
defined as follows:
S SIQA ( x) = [ S DP ( x)]α [ SGM ( x)]β [ S PC ( x)]γ (8)
15
IEEE Std 3333.1.2-2017
where
α is the weight of S DP
β is the weight of SGM
γ is the weight of S PC
When these three feature measurements are separately used as SIQA metrics, the gradient-magnitude
measurement and phase-congruency measurement can have a relatively high performance. Hence, larger
weights for the parameters of gradient magnitude and phase congruency are chosen. Then, α = 0.2 ,
β = 0.8 , and γ = 0.6 .
5.2.4 Overhead analysis
LIVE 3D Image Quality Database phase I and phase II were used to test the algorithm. The LIVE 3D
Image Quality Database phase I consists of 20 reference images and 365 distorted images with five kinds of
distortions, i.e., JP2K, JPEG, white noise, fast fading, and blur. All distortions are symmetric in nature. The
phase II database consists of eight reference images and 360 distorted images with five distortion types like
phase I. Every reference stereo pair is processed to create three symmetric distorted stereo pairs and six
asymmetric distorted stereo pairs.
Three criterions are used to evaluate the SIQA metrics including Pearson linear correlation coefficient
(PLCC), Spearman’s rank correlation coefficient (SROCC), and root mean squared error (RMSE). PLCC
and RMSE are used to evaluate prediction accuracy of SIQA metrics, and SROCC are used for prediction
monotonicity. Both SROCC and the PLCC values of the metric are larger than 0.93, and its RMSE value is
about 5.8.
5.3 Reduced reference real-time quality assessment of stereoscopic images/video
5.3.1 General
Measuring 3D video quality “on the fly” using full reference (FR) quality metrics is not feasible due to the
need for the original 3D video sequence at the receiver-side for comparison. Therefore, this subclause
discusses a reduced reference (RR) quality metric for left and right views-based 3D video using the
extracted edge information of binocular views.
5.3.2 Stereo image-quality evaluation
Luminance and contrast comparisons are carried out to quantify image-quality degradations. The luminance
comparison accounts for the overall illumination changes of the scene. The variation of the illumination
level of the scene is captured by the contrast comparison. The statistics related to luminance and contrast of
the original images are referred to as side-information. This side-information is sent to users, which are
regarded as receiver-side. At the receiver-side, the same statistics are also calculated from the received
image and then compared with the corresponding side-information sent by the sender. The disparity of the
images is calculated to quantify depth perception/binocular vision–related artifacts. The edge information
generated from the original/processed left and right views are compared to quantify the structural
degradation of the stereoscopic images. The edge-based structural degradation represents the disparity
distortion of the 3D video content. If the structural degradation is high, which means disparity of the
images is significant, the users will struggle to fuse the views and perceive depth information. User
experience will be affected as a result of increased binocular rivalry. The effectiveness of using edges to
quantify disparity of the images is demonstrated in Figure 3.
16
IEEE Std 3333.1.2-2017
Figure 3 —Sample stereoscopic image pair and the extracted edge information:
(a) left view, (b) right view, (c) extracted binary edge map of the left view using Sobel
filtering, (d) extracted binary edge map of the right view using Sobel filtering, and
(e) combined edge map of left and right images
In order to quantify structural/disparity comparison, luminance comparison, and contrast comparison

parameters for stereo images, the commonly-used structural similarity (SSIM) metric is adopted.
Equation (9) through Equation (13) describe the calculation of the SSIM metric.
The SSIM index between signal x and y is as follows:
SSIM ( x, y ) = [l ( x, y )]α [c( x, y )]β [ s ( x, y )]γ (9)
where
SSIM ( x, y ) represents the similarity map

α is greater than zero
β is greater than zero
γ is greater than zero
l ( x, y ) represents luminance comparisons
c ( x, y ) represents contrast comparisons
s ( x, y ) represents structural comparisons
The l ( x, y ) , c( x, y ) , and s ( x, y ) components are given by Equation (10), Equation (11), and Equation (12),
respectively.
2 µ x µ y + C1
l ( x, y ) = (10)
µ x2 + µ y2 + C1
17
IEEE Std 3333.1.2-2017
2σ xσ y + C2
c ( x, y ) = (11)
σ x2 + σ y2 + C2
σ xy + C3
s ( x, y ) = (12)
σ xσ y + C3
where
x is a block vector (i.e., local 8 × 8 square window) of image X

y is a block vector (i.e., local 8 × 8 square window) of image Y
µx is the mean of vector x
µy is the mean of vector y
σx is the standard deviation of vector x
σy is the standard deviation of vector y
σ xy is the covariance of vector x and y
C1 is a small constant to avoid the denominator from being zero
The mean and standard deviation are calculated per block (each block is 8 × 8). The luminance comparison
mainly provides a rating for the average intensity of the image (using the mean value), whereas the contrast
comparison provides an indication about the variation of the pixel intensity of the image frame (using the
standard deviation of individual images). The structural comparison provides an indication about the
structural degradation compared to the original image by calculating the covariance between the original
and processed images. The mean structural similarity (MSSIM) index for overall image-quality evaluation
is defined as follows:
1 M
MSSIM ( X , Y ) =
M
∑ SSIM ( x , y )
j =1
j j (13)
where
X is the reference image

Y is the distorted image
xj is the image content at the jth local window
yj is the image content at the jth local window
M is the number of local windows in the image
The MSSIM index provides a measurement of how similar the images X and Y are.
The luminance and contrast comparisons are calculated based on the corresponding left and right images,
as described in Equation (10) and Equation (11). However, the structural comparison is performed on the
extracted edge information (i.e., gradient image) of left and right images rather than on the pixel-domain. In
case of the left image-quality evaluation, the combined edge information of the original and processed left
and right views is compared to obtain an index for structural degradation of the left view using
Equation (12) (i.e., S (x′,y′)). The term S (x′,y′) is common for both left and right image-quality evaluations.
Then the proposed SSIM-based quality index for the left view (Qleft) can be described as follows:
Qleft ( x, y ) = [l ( x, y )]α [c( x, y )]β [ s ( x′, y ′)]γ (14)
18
IEEE Std 3333.1.2-2017
where
l ( x, y ) is luminance comparisons performed on original and processed left images

c ( x, y ) is contrast comparisons performed on original and processed left images
s ( x′, y ′) is the structural/disparity comparison between the combined gradient/edge maps of original
and processed left images
Then the overall left view quality is calculated as follows:
1 M
MQleft ( X , Y ) =
M
∑Q
j =1
left (x j , y j ) (15)
where
MQleft is the MSSIM quality rating for the left view with the proposed method
MQright is also calculated according to the same principles as Equation (14) and Equation (15). The only
difference is that luminance and chrominance comparisons are calculated using the original and processed
right images. To obtain a single index for 3D video quality ( MQ3DVideo ), left and right quality indexes are
averaged as
MQ3 DVideo ( X , Y ) =
( MQ left ( X , Y ) + MQright ( X , Y ) )
(16)
2
A simplified block diagram of the proposed method is shown in Figure 4.
Figure 4 —Block diagram of the proposed reduced-reference method
19
IEEE Std 3333.1.2-2017
5.3.3 Overhead analysis
Since the extracted image features need to be transmitted over the channel, either in-band or on a dedicated
connection, the overhead associated with this side-information should be kept at a minimum level. In this
method, only the combined binary edge information of the original image (i.e., ones and zeros of the edge
map/1-bit per pixel) and statistics required to perform luminance and contrast comparisons (mean and
standard deviation) will be transmitted, and hence require a lower bitrate than the full-reference methods.
If a stereoscopic image pair with resolution 720 × 576 is represented with 24-bits per pixel (YUV 4:4:4),
the FR method generates 2 × 24 × 720 × 576 = 19 906 560 bits (19.90 656 Mb) per stereo pair. Therefore,
the FR metric can require up to 19.9 Mb per stereoscopic image pair. Since the method transmits only the
combined edge map as side-information, it requires only 720 × 576 = 414 720 bits (414.720 kb) per stereo
image pair. This is a significant bitrate savings compared to that of the FR method. However, the number of
bits required for the binary combined edge map is still high. Therefore, in order to further reduce the
overhead, the reference data (i.e., binary edge mask) can be compressed, e.g., through run-length encoding
or by considering techniques for binary maps encoding such as those used for MPEG4. Since the binary
edge mask of the depth map is composed of a high number of zero values, a high compression rate is
achievable. Also, by assessing the quality of a video sequence through a reduced number of frames, the
sending rate of side-information is reduced (e.g., side-information can be sent every 25 frames). This
approach can further reduce the bitrate requirement of the proposed method.
5.4 Analysis of human 3D perception in stereo images
5.4.1 General
To assess stereo images, it is essential to quantify how much human perception is being accomplished to
acquire 3D visual information. This standard adopts a classification method for coarse 3D perception, and
proposes a measurement methodology of visual sensitivity over a 3D coordinate transform for fine 3D
perception.
5.4.2 Stereo image segment classification
5.4.2.1 General
A stereo image is divided into binocular and monocular vision segments, where each segment is
constructed by merging adjacent regions with similar spatial coherence. To achieve this, a classification
algorithm composed of three steps—which are segmentation, classification, and clustering—is performed.
In each step there is an objective function in accordance with each of the above purposes.
5.4.2.2 Segmentation
The left and right images along with the disparity maps are segmented in terms of spatial correlation, which
provides spatially diverse information from three different views to achieve higher segmentation accuracy for
the stereo image. To control the size of each segment, a criterion is used to determine whether each segment
needs to be merged to a neighbor segment. The relative object size (ROS) is defined as the ratio of the number
of pixels of an object to the number of pixels in an image, and a simple threshold on suitable object size
(ROS > 5%) is used. The segmentation of the left image is then elaborately updated with reference to the
segmentations of the right image and disparity map. The segmentation of the right image is similarly updated.
By utilizing the two updated segmentations, a segment set S is finally constructed after projecting the spatial
locations of the segmentations of the two images into a 3D space in accordance with the location information
of the disparity map. Therefore, each element of the set is composed of the 3D location information of its
segment obtained by utilizing the corresponding segments of the left and right images.
20
IEEE Std 3333.1.2-2017
The following step describes how to obtain an optimal segment set S using an objective function with a
criterion.
Step 1—Segmentation
Let Varint er be the inter-variance with neighboring segments, and Varint ra be the intra-variance of the
segment itself.
 2
 
Varint er  Si , k   E  I k  Si   I k u , v   ,  u , v   Si
 
where
Si is the i th segment in the segment set

S, I k  S i  is the average value of I k  Si 
N  Si  is the set of neighboring segments of  Si  in the 3D space
The optimal segment set S is determined to maximize the average difference of Varint er and Varint ra with
respect to the left (L) and right (R) images, and the disparity map (D).
Varint er  si , k  Varint ra  Si , k 
S  arg max
S

k L , R , D 3
s.t.Si  S (17)
Si  S th
where
Si is the number of pixels of Si

S th is a threshold of the minimum segment size
To maximize the objective function in Equation (17), the optimal segment set S is decided by finding the
affiliated segment of each pixel iteratively.
5.4.2.3 Classification
For each element, a unique class of 3D perception is determined according to the classification step, such as
binocular (stereopsis and binocular rivalry) and monocular vision (binocular suppression). Through
binocular interaction, the visual cortex determines the binocular disparity to maximize the cross-correlation
between the two images (most similar points). To be close to the mechanism of the visual cortex, the
classification is conducted based on the disparity estimation while utilizing the correlation between the
corresponding left and right segments of each element of S . The segments with higher and lower
correlations are classified into stereopsis and binocular rivalry classes, respectively. The threshold for inter-
ocular correlation between stereopsis and binocular rivalry can be obtained by the number of stimulus
elements and the element density. Moreover, binocular suppression occurs when the corresponding two
segments of the left and right images exhibit low similarity and very different stimulus strengths.
The binocular vision elements are then assorted to the stereopsis and binocular rivalry classes ( Cbs and
Cbr ), and the monocular vision elements are assorted to the binocular suppression class ( Cm ).
21
IEEE Std 3333.1.2-2017
Step 2—Classification
Let Wb and Wm be the class weights.



1, if Ci corr I  Si  , I  Si   ρ sr  Cbs
L  R 



Wb  Si     if Ci corr I L  Si   , I R  Si    ρ sr  Cbr
 
0, otherwise


1, if C C  C 
 i i br

  
Wm  Si     & max ψ I L  Si   ,ψ I R  Si    min ψ I L  Si   ,ψ I R  Si    Cm
  
0, otherwise

Decide the class of each element in S that minimizes the sum of Wb and Wm as
Ci   arg max Wb  Si    Wm  Si   ,

Ci
(18)
s.t.Ci  C, C  Cbs , Cbr , Cm 
where
Cbs is the class set of stereopsis

Cbr is the class set of binocular rivalry
Cm is the class set of binocular suppression
corr () is a function of the correlation between the two segments
ρ sr is the correlation threshold between stereopsis and binocular rivalry
ψ () is the stimulus strength obtained by the Gabor filter responses given by the following:
 R1 2  R2 2
   
σ u  σ v 
1 j uζ u  vζ v 
e 2
e
2πσ uσ v
ψ I  Si    
u , v Si
Si 
where
R1 is u cos θ  v sin θ
R2 is u sin θ  v cos θ
σu is the standard deviation of an elliptical Gaussian envelope along the x-axis
σv is the standard deviation of an elliptical Gaussian envelope along the y-axis
ζu is the spatial frequency along the x-axis
ζv is the spatial frequency along the y-axis
θ is the orientation
22
IEEE Std 3333.1.2-2017
As the unit of human perception, the segment shall be defined to satisfy the two conditions: 1) the locally
restricted region, and 2) the same perceptual properties. From the segmentation and classification methods,
the segment is determined according to the 3D location of 3D pixels. The segment is then classified
according to similarity between the stereo segments.
5.4.2.4 Clustering
There exists a relationship between human perception and the size and shapes of segments. Variation in a
wider segment is more easily adapted to than in a narrower segment. The compactness of a segment is
defined as the mean distance between the central point and other points inside the segment.
E  u  2  v  2 
C  Si     ,  u , v   Sic ,  u , v   Si (19)
 2  2
E

uic  u   vic  v 

where
uic , vic  is the center of Si
Since a circle has the highest possible compactness among objects of a given size, the compactness can be
normalized by the same-sized circle Sic .
The optimal clustered segment set Sˆ  and clustered class set C

ˆ  are obtained by using an objective
function incorporating the size and compactness of the segment. The clustering based on the perception
class is performed by using the following step.
Step 3—Clustering
If S j   N  S I   j  i  and Ci   C j  , S j  can be clustered with Si  relying on the compactness of the

cluster. Let Ŝ be the updated segment set of S after clustering, and let Ĉ be the associated class set of Ŝ .
The optimal sets of Sˆ  and C
ˆ  are obtained by
arg max 
Sî C Sî  
Sˆ  , C
ˆ  ˆ ˆ
S, C
N Si
s.t.Sî  Sˆ , Cî  C
ˆ
if Ci   C j  or S j   N  Si    not to be clustered (20)
where
Sî is the number of pixels in Sî

N Si is the number of segments
To maximize an objective function in Equation (20), the optimal clustered segment set S and clustered
class set C are decided by iteratively controlling the affiliated segment of each merged segment.
23
IEEE Std 3333.1.2-2017
5.4.3 Fine 3D perception
5.4.3.1 3D saliency detection
With 3D perception humans can move the fixation in 3D space depending on their intention (visual
attention). Neurons in the dorsal stream are in charge of the vergence eye movement in this case. Human’s
attention is attracted to visually salient stimuli. Visual salience helps the human brain achieve reasonably
efficient selection. The human brain has evolved to rapidly compute salience in an automatic manner and in
real-time over the entire visual field. Thus, visual attention is attracted toward salient visual locations.
A new 3D saliency detection algorithm using spatial and disparity data to estimate visual attention is
introduced by Kim, Lee, and Bovik [B18]. Here this algorithm is employed to detect salient regions in
stereoscopic images.
5.4.3.2 3D coordinate transform
3D visual sensitivity based on binocular foveation and fusion mechanisms can be interpreted via a 3D
coordinate transform. This 3D coordinate is a representation of visual geometry when the non-uniform
visual resolution is mapped onto the uniform domain. Foveation refers to the non-uniform sensitivity of the
human eye, where the resolution decreases as one moves away from the fovea (XY plane). Fusion refers to
the process by which the human brain combines the left and right views to create a fused image. The farther
from the fixation along the Z-axis, the less the brain can fuse the two views. Thus, these two effects can be
applied sequentially due to their orthogonality. Thus, it is persuasive to obtain the perceptual weight ( w ) in
this domain. The local bandwidth of the foveated space corresponds to uniform sampling in curvilinear
coordinates based on foveation.
In addition, to reflect binocular vision afforded by binocular fusion, the foveated image is transformed into
a locally band-limited signal once more. Then the foveated space in curvilinear coordinates also becomes
the space over the curvilinear coordinates by binocular fusion.
However, the monocular vision segment is unaffected by binocular fusion since it is not associated with the
depth resolution. Thus, the binocular suppression segment is excluded from the binocular fusion-based 3D
 
coordinate transform. The final 3D perceptual weight w  w u , v   u , v   Sˆ  of segment Sˆ  is i
w u , v   
   
 fus fov Sˆ  , if Cˆ   C
b i i  ˆ Cˆ 
bs br 
,  u , v   Sî  (21)
 fov Sˆ  ,
 
 m i
ˆ
if Ci  Cm
 ˆ 
where
fovm is the weighting function of monocular foveation

fovb is the weighting function of binocular foveation
fus is the weighting function of binocular fusion
24
IEEE Std 3333.1.2-2017
5.5 Spatial quality pooling based on human fine 3D perception
5.5.1 General
As an application using human 3D perception, a new spatial pooling strategy based on human 3D
perception is designed in accordance with human 3D perceptual properties, and analyses of stereo images.
The binocular vision segment is assessed by using both the left and right views, but the monocular vision
segment is assessed by one of the two views. The low-quality prioritized spatial pooling method is used.
Moreover, a larger segment with high perceptual sensitivity is also pooled.
5.5.2 Quality assessment according to the perception class
In binocular vision segments ( Cˆ and C ˆ ) humans can perceive 3D from the left and right segments. For
bs br
accurate assessment of these segments it is important to reproduce the 3D content in accordance with
human perception for the quality assessment (QA). For this purpose, a fused or overlapped segment via
cyclopean presentation is created. Nevertheless, it is difficult to create a realistic cyclopean segment close
to the one perceived by the brain, since it is unknown how the cyclopean image is formed physiologically.
Moreover, it is necessary to account for display geometry, the presumed fixation, vergence,
accommodation, and so on. Toward a limited approximation of this goal, however, an internal segment is
synthesized to achieve a quality level close to that of the true cyclopean segment. A linear model is used to
synthesize the cyclopean segment from stereo segments. The cyclopean image is synthesized from the
Gabor filter responses, which are extracted from the left and disparity-compensated right segments as the
 
relative strength. The distorted cyclopean segment Iˆ C Sˆ  , which comes from the segment Sˆ  of the
i i
distorted stereo image Iˆ L , Iˆ R  , is given by
         
Iˆ C Sî   w L Sî   Iˆ L Sî   w R Sî   Iˆ R Sî  (22)
where
wL is the weighting coefficient obtained by the Gabor filter responses for the left segment
wR is the weighting coefficient obtained by the Gabor filter responses for the right segment
 
Because of the law of complementary shares, w L u , v   w R u , v   1  u , v  Sî  . In other words,
increasing the Gabor energy for one of the two images suppresses the dominance of the opposite image
conversely.
Without loss of generality, it is assumed that the original stereo image I L , I R  is completely stereopsis,
and the disparity map from the original is also perfectly matched with human perception. Under this
assumption, the real cyclopean image can be better approximated. Therefore, Equation (22) is applied to
   
the original stereo segments I L Sˆ  and I R Sˆ  to obtain the original cyclopean segment I C Sˆ  . Since
i i  
i
the QA is performed in the unit of segments, segmentation is performed the same for the original image and
the distorted image. In other words, the binocular vision segments of the distorted stereo image are assessed
by comparison between the original and the distorted cyclopean images.
In the binocular suppression segments, only one image with greater stimulus strength than the other is seen.
Thus, the binocular suppression segment is compared with the original segments I L Sˆ  or I R Sˆ   
i  
i
  is then obtained by

depending on the kind of view. The quality score of each segment Q  Q i  Sî  Sˆ 
the following:
25
IEEE Std 3333.1.2-2017

 FQ I C

 Sˆ , Iˆ Sˆ , if Cˆ  Cˆ , Cˆ 
i
 C
i

i

bs

br


Q i    FQ I L
  Sˆ , Iˆ Sˆ , if Cˆ  Cˆ & ψ I Sˆ   ψ I Sˆ 
i
 L
i

i

m
 L
i
 R
i

(23)


 Sˆ , Iˆ Sˆ , if Cˆ  Cˆ & ψ I Sˆ   ψ I Sˆ 
 FQ I R i
 R
i

i

m
 L
i
 R
i

where
FQ is the full-reference quality metric, such as peak signal-to-noise ratio (PSNR) and SSIM
5.5.3 Adaptive spatial pooling
As shown in Figure 5, humans generally pay more attention to poor regions than good ones when they
conduct subjective QA. Thus, the overall quality is greatly influenced by low-quality regions in Figure 5(a).
Moreover, the size and location of a segment are important factors in determining human sensitivity. The
closer the segment is to the fixation, the higher the sensitivity is in Figure 5(b) in terms of location. In
addition, the larger the segment is, the higher it is in Figure 5(c) in terms of size.
Figure 5 —Humans tend to pay more attention to the segment of low quality in (a),
near to the fixation in (b), and large size in (c)
For the exact quality pooling, first the location and size of each segment are applied to the quality of the
segment Q i  . The 3D transformed area of the segment over the curvilinear coordinate is employed, which
is defined by the perceptual weight w in Equation (21). Since the distance from the fixation is reflected in
the perceptual weight w, the location and size of each segment can be applied at the same time. The 3D
transformed area is given by the following:
2
wa i    w u , v  (24)
u , v Sî
=
Based on wa, the spatially pooled quality ˆ {Qˆ (i ) | Sˆ * ∈ S*
Q ˆ } is given by the following:
i
Qˆ (i )
Qˆ (i ) | (25)
wa ( i )
Equation (24) and Equation (25) indicate that a larger segment near the fixation leads to a decrease in the
value of Qˆ (i ) . In other words, the segment with low Qˆ (i ) is highly sensitive to humans, as shown in
Figure 5.
26
IEEE Std 3333.1.2-2017
Let z  j  be a function of returning an index of the j th lowest element of Q̂ . The overall quality is highly
affected by the bottom p  1 of the quality region. Thus, the segment set Sˆ  is divided into the low-
quality set (high sensitivity) P and the complementary set (low sensitivity) P c as follows:
 k
Qˆ  z  j  

P  S zk 
 j 1
 p

(26)

 Qˆ  z  j  
 j 1 
P c  Sˆ   P (27)
The final 3D quality score Q can then be obtained by the graded sum of the qualities of P and P c , which
are normalized by the 3D transformed area wa .
 Sî P
*  wa i Q i   r  Sˆ P  wa i Q i 
*
Q 
c
i
(28)
 Sî* P
wa i   r  Sˆ * Pc wa i 
i
where
r 1 is the scale factor accounting for the reduced perceptual contribution of the scores in P c to the
overall 3D quality
5.6 Objective image-quality assessment of 3D synthesized views
5.6.1 General
3D synthesized view image-quality metric (3DSwIM) is a full-reference objective quality metric dedicated
to artifacts detection in DIBR-synthesized view-points. The metric is based on two main assumptions. First,
pixels or region displacement can be introduced by the rendering process without affecting the visual
quality of the synthesized images. Second, human perception is more sensitive to artifacts affecting regions
containing human beings (i.e., faces or hands), which leads to severe subjective quality scores.
5.6.2 3D synthesized-view image-quality metric
3DSwIM is based on the analysis of the statistical features, extracted from the wavelet transform, of the
original and of the DIBR-based synthesized images.
The block diagram of the metric is shown in Figure 6, while a more detailed description of the block-
processing step is in Figure 7.
Figure 6 —Block diagram of 3DSwIM
27
IEEE Std 3333.1.2-2017
Figure 7 —Details of the block-processing step
The quality assessment of a frame F, of size n × m pixels, is computed as follows:
 Block partition: given the real and the synthesized views, a frame partition into B non-overlapping
blocks b is performed.
 Registration: a registration procedure is performed for allowing the comparison of matching blocks
between the original and the synthesized views. An exhaustive search-like algorithm is selected and
a search window of size W pixels in the horizontal direction is used. This algorithm calculates the
cost function at each possible location in the search window, and the best matching candidate is
chosen.
 Skin detection: it has been noticed that the presence of human beings in the image under test
increases the annoyance of the detected artifacts. Thus, the skin detection procedure is performed to
perceptually weight ( Wskin ) each block containing distorted faces, necks, etc. In more details, the
adopted skin detector is based on the color segmentation performed on the H component of the
HSV color space based on the method presented in Oliveira and Conci [B31].
 Wavelet transform: each block undergoes a first level Haar wavelet transform. The image
degradation is evaluated by analyzing the statistical variations in the wavelet sub-band related to
the image horizontal details.
 In fact, the filled holes generated through the DIBR process are mainly characterized by high
frequencies in the horizontal direction. If the virtual views are generated to recreate the setup of a
stereo camera, the holes are mainly located close to the vertical edges of the objects, as can be
noticed in Figure 8. These holes correspond to discovered areas that are not visible from the
reference views, and they are stretched in vertical way and correspond to horizontal details.
Therefore, the image degradation can be measured by analyzing the statistical variations in the
wavelet sub-band related to the image horizontal details.
 Histogram computation and block distortion computation: the Kolmogorov-Smirnov (KS) distance
between the two histograms is computed to quantify the distance between the distribution function
of the real view FOb ( x) and the distribution function of the synthesized view FSb ( x) . Based on this,
the block distortion can be computed as follows:
= (
db max FOb ( x ) − FS b ( x ) ) (29)
28
IEEE Std 3333.1.2-2017
Figure 8 —An example of two views after 3D warping and no hole-filling;

black areas correspond to the holes to be filled
 Overall image-quality score: the overall normalized image distortion can be computed as follows:
1 B
d=
DO
∑d
b =1
b (30)
where
DO is a normalization constant
The image-quality score is given by the following relation:
1
s= (31)
1 B
1+
DO b =1
(
∑Wskin max FOb ( x ) − FSb ( x ) )
The score s ranges in the interval [0; 1] where a lower distortion corresponds to a higher score ( d = 0 and
s = 1 ) and a higher distortion corresponds to a lower score ( d → ∞ and s = 0 ).
5.7 A 3D subjective quality-prediction model based on depth distortion
5.7.1 General
In this standard, a 3D subjective quality-prediction model is proposed to estimate the 3D quality of

synthesized stereo pairs based on depth map distortion and neural mechanism, instead of performing view
synthesis directly, which benefits 3D processing. In order to build the model, a dataset is first constructed
(Clause 8) to include distinctive distortion features for depth map coding, and then a subjective test is
conducted. Based on the subjective evaluation results, a prediction model is proposed to estimate the
perceived 3D quality of the synthesized stereo pairs with the features extracted from the texture
characteristics of the decoded depth maps and neural population coding model.
5.7.2 Homogeneity detection model
The system architecture is shown in Figure 9. The input of the distorted data generation is the 3-view color
maps and corresponding depth maps. Sequences with representative content characteristics are selected
from the test set defined in the common test condition of 3D high-efficiency video coding (HEVC) (the
latest 3D video coding project of the Moving Picture Expert Group [MPEG]).
29
IEEE Std 3333.1.2-2017
Figure 9 —The proposed system architecture
The homogeneity-detection model divides the depth map region into two types. Pixel homogeneity is
defined as follows:
1 1
HOMO
= x ∑ vi , j − 9 i ,∑
9 i , j∈O j∈O
vi , j (32)
where
x is a pixel in the depth map
O is the representation of the region including nine pixels centering on x
vi , j i j
is the intensity value of the pixel  ,
The pixels in depth map are classified into two types based on the pixel homogeneity: texture pixel and
homogeneous pixel, as follows:
HOMO x > HOMOThre x ∈ texture pixel (33)
HOMO x ≤ HOMOThre x ∈ homogeneous pixel (34)
where
HOMOThre is a threshold
The coding tree unit (CTU), which is the basic coding unit defined in 3D-HEVC, consists of 64 × 64 pixels.
In a depth map, a CTU will be classified as a texture (Tex) CTU or a homogeneous (Homo) CTU according
to the number of texture pixels. The classification is given by the following:
N > T CTU ∈ texture CTU (35)
N ≤ T CTU ∈ homogeneous CTU (36)
where
N represents the number of texture pixels in the CTU

T is the threshold
30
IEEE Std 3333.1.2-2017
5.7.3 Subjective quality-prediction model
To predict the perceived 3D quality of the synthesized stereo pairs according to the depth map, features
from depth maps together with its resulted neural responses are extracted to train a support vector
regression (SVR) based model.
Three features are extracted from the texture characteristics of the distorted depth maps as:
 Ratio of Tex CTUs: This is the ratio of Tex CTUs in depth maps. It denotes the complexity of the
image scene which can be relevant to the mean opinion score (MOS) level.
 Average distortion of Tex CTUs: This is the average objective distortion of Tex CTUs in depth
maps, which is calculated by sum of squared differences. It denotes the distortion level of the Tex
regions in depth maps.
 Average distortion ratio of Tex CTUs to Homo CTUs (DT/DH): This is the average distortion of
Tex CTUs in the depth map, DT, relative to the average distortion of Homo CTUs in the depth map,
D H.
Figure 10 shows an example of MOS variations with DT/DH. It can be observed that, no matter how high
(color map coded at quantization parameter [QP] = 25) or low (color map coded at QP = 40) the bitrate is,
the DT/DH value affects the MOS and the relationship cannot be represented by a linear model.
Figure 10 —Subjective quality of a scene versus DT/DH
Since the perceived 3D quality is a result of a complex interaction among the distortion in texture maps,
distortion in depth maps, and the response of visual neurons, 13 features are extracted from the neural
population coding model. In the neural population coding model, the fine features are defined in terms of
the estimated neural activity levels in the middle temporal (MT) region of the brain, which plays
an important role in encoding horizontal disparity for vergence eye movements. Disparity responses of
13 typical MT neurons are tuned using Gabor functions. The tuning function of the ith typical MT neuron
can be modeled as follows:
i 2
/ σ i2 )
) R0i + Ai e −0.5(( d − d0 )
Ri (d= cos(2π f i (d − d 0i ) + Φ i ) (37)
where
d is disparity
R0i is the baseline response
Ai is the amplitude of the Gaussian kernel
d i
0 is the center of the Gaussian
31
IEEE Std 3333.1.2-2017
σi is the width of the Gaussian

fi is frequency
Φi is the phase
Based on Ri (d ) , the ith neural feature E[ri ] is generated by the following:
=E[ri ] ∑ P[d ]R (d ),
d
i 1 ≤ i ≤ 13 (38)
where
P[d ] is the probability distribution of disparity d
The actual depth value in the world coordinate z can be calculated by the following:
1
z (39)
v  1 1  1
   
255  znear z far  z far
where
v is an 8-bit depth value

znear is the nearest clipping plane
z far is the farthest clipping plane from the origin in world coordinate
As shown in Figure 11, b is baseline of cameras, and z is the actual depth value in world coordinate.
Points A, B, and C represent the pixels converge before the screen, on the screen, and behind the screen,
respectively. θ 0 , θ1 , and θ 2 are the convergence degree. The actual depth value of position B is equal to
the convergence distance. Thus, the disparity of A and C is defined as the following:
θ 0 θ1 , dC =−
d A =− θ 2 θ1 (40)
where
θ 0 , θ1 , and θ 2 can be calculated by the relationship between b and z
As the disparity is defined by degree, the input of Equation (37) is used to calculate the responses of neurons.
The model is built using an SVR-based training, which maps feature vectors to the predicted subjective
scores. The dataset includes 308 samples. Eighty percent of the stereo pairs are used for training and 20%
are used for testing. The accuracy of the model is measured by SROCC, PLCC, and RMSE between the
objective and subjective scores.
Figure 11 —Disparity defined by viewing degree
32
IEEE Std 3333.1.2-2017
6. Quality-of-experience assessment for ultra-high-definition contents

based on the human visual system
6.1 General
In this clause, the sharpness and preference assessment problems on UHD videos and images will be
considered. In order to do that, different features based on viewing geometry, discrete Fourier transform
(DFT), and natural scene statistics (NSS) will be defined.
6.2 Blind sharpness prediction for ultra-high-definition videos based on the

human visual system
6.2.1 General
This subclause describes a new, no-reference sharpness-assessment model for predicting the perceptual
sharpness of UHD videos through analysis of visual resolution variation in terms of viewing geometry and
scene characteristics. The model also accounts for the resolution variation associated with fixation and
foveal regions, which is another important factor affecting the sharpness prediction of UHD video over the
spatial domain, and which is caused by the non-uniform distribution of the photoreceptors.
6.2.2 Spatial resolution over retinal domain
6.2.2.1 Viewing geometry analysis
Figure 12 shows the transition of the viewed pixel size depending on the viewing geometry. The
perceivable pixel size varies according to the viewing distance and the display resolution, and according to
the display size and resolution [see Figure 12(a)]. When the size of a perceivable pixel becomes larger and
its spatial resolution decreases, the perceptual sharpness of the image decreases. Figure 12(b) shows a
diagram of the angular resolution N ( N x , lx , z ) as a function of viewing distance, display size lx , and
resolution N x . By using these parameters, the number of viewed pixels per degree can be expressed as
Nx  π 
N ( N x , lx , z ) = z tan   (41)
lx  180 
where
Nx is the number of pixels along the horizontal direction

lx is the display size along the horizontal direction
z is the viewing distance between the display and the viewer
Figure 12(c) shows the viewing frequency fV as a function of display size and viewing distance
N ( N x , lx , z )
( fV = ). The viewing frequency fV is lower when the display size is larger or the viewing
2
distance is shorter. In addition, the perceived frequencies at the UHD resolution are higher than those at the
HD resolution with the same viewing geometry.
33
IEEE Std 3333.1.2-2017
Figure 12 —Examples of changing the pixel size according to (a) display size and
viewing distance, (b) diagram of viewing resolution, (c) viewing frequency (fV)
according to viewing distance, display size, and resolution
6.2.2.2 Viewing resolution analysis based on viewing geometry
Figure 13 shows an example of pixel values over the image, display, and retinal domains. Figure 13(a)
shows the pixel values in the visual angle of 1° over the image domain. N ( N x , lx , z ) in Equation (41)
varies depending on the viewing geometry parameters. For simplicity, the number of pixels will be denoted
as N instead of N ( N x , lx , z ) .
Because the pixels are shown in the display, each value is displayed as a block and not as a point, i.e., as
the light value of a pixel block, as shown in Figure 13(b), not as a digitized value. Thus, let b( x1 , x2 ) be the
pixel value at ( x1 , x2 ) over the image domain for 0 ≤ x1 , x2 ≤ N − 1 , as shown in Figure 13(a). The pixel
( β1 + β1b )
β3
value is then converted to =
bL as the value of a pixel block in luminance intensity over the
display shown in Figure 13(b), where β1 = 0.7656 , β 2 = 0.0364 , and β 3 = 2.2 under the assumption of
the Adobe red, green, blue (RGB) display condition. In the display domain, the number of pixels N is
mapped onto the number of pixel blocks N , one by one. The size of each block is lx N x as a function of
the display size lx and resolution N x , as shown in Figure 12(b) and Figure 13(b). Figure 13(b) shows the
luminance intensities over the display domain after the conversion from the pixel values.
34
IEEE Std 3333.1.2-2017
Figure 13 —Example representation of the pixel values over (a) image domain,
(b) display domain, and (c) retinal domain
The light emitted from the display first passes through the optics of the eye, and then is sampled by
photoreceptors on the retina. Thus, the resolution ability of the human eye is determined by the density of
its photoreceptors. Figure 13(c) shows the values sampled by the photoreceptors through N̂ samplers over
the retinal domain. Figure 13(c) shows an example in which the light of each pixel block is upsampled in
the ratio of 1:3 in accordance with the resolution of the human eye. However, depending on the viewing
geometry, it could be downsampled. When the size of the pixel block is large due to a large display size, it
can be captured by more samplers on the retina. In general, the light of each pixel block is sampled in the
ratio of N̂ N on the retina.
To analyze the pixel values over the image domain in terms of visual resolution over the retinal domain, it
is necessary to express the metric in terms of the resolution by N̂ for better perceptual sharpness and
quality evaluation. The nearest neighbor upsampling is adopted to derive the luminance bR ( x1 , x2 ) on the
retina from bL . x̂1 and x̂2 are the horizontal and vertical sample coordinates on the retina, respectively
( 0 ≤ xˆ1 , xˆ2 ≤ Nˆ ). Finally, the DFT coefficient bR (uˆ1 , uˆ2 ) over the retinal domain is derived from bR .
35
IEEE Std 3333.1.2-2017
Figure 14 —Examples of the perceived images according to display size

and resolution and their DFT spectra
Figure 14 shows an example of perceived images and their DFT spectra in accordance with the display size
and resolution. The top images represent the perceived images from displays with HD resolution and
different display sizes. The left display is smaller than the right display. For a given block size, two blocks
are indicated by HD-L and HD-R. Similarly, the same images with UHD resolution are displayed at the
bottom for two different display sizes. Two blocks with the same block size are indicated by UHD-L and
UHD-R. The spectra of UHD-L and UHD-R are distributed up to higher frequencies than the spectra of
HD-L and HD-R. When the display size becomes larger, the viewing resolution decreases and the DFT
spectrum narrows. Therefore, by using the DFT spectrum over the retinal domain, it is possible to analyze
the perceived image projected onto the retina.
6.2.3 Scene adaptive sharpness measurement
6.2.3.1 Mode classification
To evaluate the sharpness of an image adaptively, the mode of each pixel over a local window is classified
according to the types of blur and texture. Each local region of 128 × 128 is sampled from the image.
Figure 15 shows the mode classification for adaptive local sharpness measurement.
The existence of motion blur at each region by measuring the motion velocity is defined. For the nth local
region, the motion velocity V (n) is obtained using an optical flow algorithm. If V (n) > Vth , the nth local
region is thought to exhibit motion blur. Vth is the threshold needed to classify the region as a motion-blur
region. Here, it is set to Vth = 4 (pixels/frame).
36
IEEE Std 3333.1.2-2017
Figure 15 —Mode classification for adaptive local sharpness measurement
Next, the texture or edge classification proceeds sequentially. The DFT spectrum of each region is divided
into four parts: direct current (DC), low frequency (LF), medium frequency (MF), and high frequency (HF).
= MR + HF . Based on TexE as the
Then, the texture energy (TexE) for each block is approximated by TexE
sum of the magnitudes of DFT coefficients for each part, the local regions are classified further as texture
and edge regions, as follows:
Edge, if TexE ≤ µ1
 (42)
Texture, otherwise
where
µ1 is a constant
After this step, each region is classified as one of the four modes shown in Figure 15.
6.2.3.2 Adaptive sharpness evaluation
The directional features are defined to capture changes in the orientation structure of the spectrum when an
image is affected by defocus or motion blur. Directional sharpness is captured using the variance and
energy in the spectral domain along a specific direction, and it tends to decrease in the motion direction.
The directional energy DE (φm ) and directional variance DV (φm ) of the DFT coefficients are as follows:
1
−
 S / 2 Cφ (u ) 2  2
DE (φm ) =  ∑ m
 (43)
 u = − S / 2 B(0, 0) 
S /2
=DV (φm ) ∑
u= −S /2
pφm (u )[ fφm (u ) − µφm ]2 (44)
37
IEEE Std 3333.1.2-2017
where
=Cφm (u ) {B(u, u tan(φm )) | u ∈ (− S / 2, S / 2)}
2
cφ m u 
Pφ m u   S /2
2

u   S / 2
Cφ m u 
B is the DFT of the image

S is the DFT block size
Cφm (u ) contains the DFT coefficients of the image along the direction of φm , where φm = {φ1 , φ2 ,..., φn } is
the mth direction and N dr is the number of directions. In addition, pφm (u ) is the probability mass function
of the normalized power spectrum along each direction φm ; µ m is the first central moment of the power
spectrum along each direction φm ; and fφm (u ) is the spatial frequency. Linear interpolation is used to
calculate the u tan(φm ) coefficients that do not fall on the discrete grid. The sharpness of an image is
affected by directional variance (DV) and directional energy (DE). Thus, the directional sharpness DS (φm )
of the image along the direction φm is expressed as the product of DE (φm ) and DV (φm ) :
DS
= (φm ) DE (φm ) × DV (φm ) (45)
The high frequency coefficients of the DFT spectra are attenuated when defocus or motion blur occurs. To
quantify the attenuation of the DFT spectra, DS and DS  are defined, which respectively indicate the
amounts of directional sharpness and variation. When defocus blur occurs, the overall DFT spectra is
attenuated regardless of direction. Thus, the amount of directional sharpness ( DS ) decreases according to
defocus blur. Moreover, the DFT spectra is attenuated according to the motion blur with a specific direction,
and the variation in directional sharpness ( DS ) increases. Using these characteristics, the local sharpness
score is defined utilizing two factors as the following:
1−η
η  c 
=s DS ×  (46)
 + c 
 DS
where
η is the weight parameter

c is a variable used to stabilize the score
Table 1 shows the equations for the calculations of the amount and variation of directional sharpness,
depending on the mode.
38
IEEE Std 3333.1.2-2017
Table 1—Local sharpness measurement for each mode

1−η
η  c 
Local sharpness score:
= s DS × 
 + c 
 DS
Mode 1: motion/texture
σ
DS = DS (ψ motion )  = sφ
DS
µ sφ
Mode 2: motion/edge
DS = DS (ψ motion )  =0
DS
Mode 3: no motion/texture
1 N dr σ
 = sφ
DS = ∑
N dr m =1
DS (φm ) DS
µ sφ
Mode 4: no motion/edge
DS = DS (ψ edge )  =0
DS
6.3 Visual preference assessment on ultra-high-definition (UHD) images
6.3.1 General
In this subclause, a visual preference assessment on UHD images is introduced. The model is designed to
predict how much a sharpness or contrast level is preferred by the users. It is constructed by training an
SVR based on a set of features derived from viewing geometry, visual contents, NSS of images, and the
corresponding MOS.
6.3.2 Features of visual preference
6.3.2.1 General
This standard considers three types of features: viewing geometry features, content features, and natural
scene statistics (NSS) features. First, several feature maps are defined. Then, global and local feature
extraction approaches are presented to derive the features from these maps.
6.3.2.2 Viewing geometry
The first feature in the viewing geometry features is the maximum viewing frequency fV , which is
discussed in 6.2.2.1. Next, DFT is performed on the images to obtain I , and then a low-pass filter (LPF)
with a cutoff frequency of f is used to obtain the spatial frequency I LPF . To apply the visual resolution to
V
the proposed model, spatial frequency in terms of DFT spectra is used.
6.3.2.3 Content feature maps
The content features include sharpness features, contrast features, and saturation features. These features
are derived from color and luminance gradients, color and luminance contrasts, and color and luminance
saturations (Kim, Ahn, Kim, and Lee [B17]). The color and luminance gradients are defined as follows:
1 ∂I 2 ∂I 2
=Gl + (47)
GlM ∂u ∂v
39
IEEE Std 3333.1.2-2017
1 ∂Ca2 ∂Ca2 1 ∂Cb2 ∂Cb2

Gc
= + + + (48)
2GcM, a ∂u ∂v 2GcM,b ∂u ∂v
where
Gl is the normalized luminance gradient

Gc is the normalized color gradient
u is the indices of the x-axis
v is the indices of the y-axis
I is the luminance map
GlM is the maximum luminance gradient of I as a normalization factor
Ca is a color channel of I in the perceptually uniform CIELab color space
Cb is a color channel of I in the perceptually uniform CIELab color space
GcM, a is the maximum color gradient of Ca for normalization
M
G c ,b is the maximum color gradients of Cb for normalization
The contrast maps are calculated as follows:
1 1 N  Ij −I 
Cl =
ClM
∑
N − 1 j =1  Ck
 (49)

2
1 1 N  Ck , j − Ck 
Cc = ∑ ∑1  C  (50)
M
k ∈{ a , b} 2Cc ,k N −1 j= k 
where
Cl is the normalized luminance

Cc is the color contrast
N is the patch size
j is the index of the patch
I is the mean of the luminance of the patch
CiM is the maximum luminance contrast for normalization
Ck is the mean of the patch colors
Finally, the saturation maps are as follows:
127 − I (u , v)
Sl = (51)
127
127 − Ca (u , v)
Sca = (52)
127
127 − Cb (u , v)
Scb = (53)
127
Sc = Sca Scb (54)
40
IEEE Std 3333.1.2-2017
where
Sl is the normalized luminance saturation (Kim, Ahn, Kim, and Lee [B17])
Sca is the normalized color saturation on Ca
Scb is the normalized color saturation on Cb
Sc is the normalized color saturation
6.3.2.4 Natural scene statistics feature maps
The overall procedure is depicted in Figure 16. First, each feature map above ( Gc , Gl , Cc , Cl , Sc , and Sl )
are transformed by using wavelet decomposition at two scales. Then, for each band of the wavelet
coefficients, a histogram is constructed. Next, a generalized Gaussian distribution is fitted to the histogram.
From the shape parameters of the distributions, the visual activity is defined as follows:
 1
1 − , if γ k ,i < ck
  
 1 + exp −2  γ k ,i − ck
    
   ck , D  
Ak ,i =  (55)
1 − 1
, otherwise
   γ k ,i − ck  
 1 + exp −2   

   ck ,U  
where
Ak ,i is the visual activity of the ith wavelet sub-band of the kth feature
ck is the reference operating point of the kth feature map
ck , D is the trailing edge of the region over which Ak ,i is approximately linear
ck ,U is the leading edge of the region over which Ak ,i is approximately linear
Figure 16 —Procedure for measuring the visual activity of each band

through natural scene statistics
41
IEEE Std 3333.1.2-2017
The visual activity of each feature map is Ak = ∑ Ak ,i / N sub . The overall visual activity as the NSS feature
i
is calculated as follows:
A = ∑ wk Ak (56)
k
where
wk is the weight of each feature for the weighted sum introduced in Lee, Moorthy, Lee, and Bovik
[B21].
6.3.2.5 Global and local feature extractions
Global and local features are extracted from the feature maps introduced above ( I LPF , Gc , Gl , Cc , Cl , Sc ,
and Sl ). Figure 17 depicts the concept of global and local feature calculation. For the former, all values in
each feature map are averaged. For the latter, each feature map is partitioned into blocks and the mean
values of the upper and lower p % blocks are calculated. In total, there are 29 features, which are
summarized in Table 2.
Figure 17 —Description of global and local feature calculation
Table 2—Description of the visual preference features

Category Feature description Feature type
Viewing geometry Maximum viewing frequency Global
features Spatial frequency
Normalized luminance gradient
Normalized color gradient Global,
Normalized luminance contrast Local 1,
Content features
Normalized color contrast Local 2
Normalized luminance saturation
Normalized color saturation
Visual activity of luminance/color gradient
Visual activity of luminance/color gradient
NSS features Global
Visual activity of luminance/color saturation
Overall visual activity
6.3.3 Visual preference assessment model
The visual preference assessment model is constructed by training an SVR on the set of the visual
preference features and subjective test MOS. The computed PLCC (~0.8) and SROCC (~0.8) indicate that
the assessment model performs well as a predictor of human visual preference.
42
IEEE Std 3333.1.2-2017
7. The impact of compression and packet losses on 3D perception
7.1 General
In this clause, a 3D video database for the evaluation of visual quality assessment metrics is described. The
effects of packet losses on the overall 3D perception (i.e., distortions due to different packet loss rates [PLRs])
are considered in this research. The database presented here contains 54 test 3D video sequences from nine
reference test videos and six different PLRs, including the original 3D image sequence. In order to obtain true
3D video perception, about 1730 individual human quality observations are considered for this database. The
obtained differential mean opinion score (DMOS) can be effectively used for evaluating 3D video quality
metrics, as well as designing new 3D video quality evaluation methods. Together with DMOS values, this
provides the corresponding objective quality measurements using several objective quality metrics. The
designed 3D video database is freely available for download and use in scientific research.
7.2 Description of the proposed 3D video database
7.2.1 Source sequences
Nine uncompressed 3D video sequences from different sources are considered for this database. The details
of these sequences are listed in Table 3. These sequences cover a wide range of texture and motion
complexity, depth levels, and frame rates (e.g., 15 fps and 30 fps) and image resolutions (e.g., HD
resolutions and mobile resolutions, such as 480 pixels × 270 pixels). Ten-second-long sequences are
considered for the subjective tests. When there were not enough frames to compose 10-second sequences,
frames were repeated backward to create a smooth sequence.
Table 3—Description of the selected sequences

Frame
Sequence Resolution Characteristics Production
rate
Natural, indoor sequence, medium motion,
1 Café 1920 × 1080 30 GIST
medium complexity depth
Natural, indoor sequence, slow motion,
2 Beergarden 1920 × 1080 30 3D4YOU project
medium complexity depth
Newspaper
3 1024 × 768 Natural, indoor sequence, medium motion 30 GIST
(cameras 4 & 6)
Lovebird 1
4 1024 × x768 Natural, outdoor sequence, slow motion 30 ETRI
(cameras 6 & 8)
Kendo Natural, indoor, panning camera medium
5 1024 × 768 25 Nagoya University
(cameras 3 & 4) motion, complex depth structure
Ballroom Natural, indoor, very fast motion, complex Mitsubishi electric
6 640 × 480 15
(cameras 3 & 4) depth structure research labs, USA
Synthetic sequence, fast motion, low
7 Mobile 720 × 540 15
complexity depth
Outdoor, high detail, complex depth KUK film
8 Horse 480 × 270 15
structure, natural light production
Outdoor, complex object motion, camera KUK film
9 Car 480 × 270 15
motion, depth structure production
43
IEEE Std 3333.1.2-2017
7.2.2 Test sequences
Five test sequences from each of the reference 3D sequences were created by using simulated PLRs of 0%,
1%, 3%, 5%, and 10%. PLRs above 10% was not considered since it is highly unlikely with the recent error
protection mechanisms to lose more than 10% packets even in the most adverse channel conditions.
Initially, the original 3D video sequences are compressed using the H.264/AVC encoding standard (JM
reference software version 12.0). In order to maintain the right balance between good image quality and
available bandwidth, it is necessary to select an appropriate QP for encoding. After several encoding tests
with the selected sequences for the 3D database, QP = 30 was selected for encoding the original 3D video
sequences. IPPP...IPPP... sequence format is selected for encoding. In order to obtain a more robust
bitstream, slices were introduced in the encoded bit-stream. One macroblock row of each image frame is
considered as a separate slice. These are encoded into real-time transport protocol (RTP) bit-stream format
to simulate random RTP packet losses. The packet loss simulator from the JM reference software suite is
employed to drop packets randomly at 0%, 1%, 3%, 5%, and 10% loss rates. In order to conceal the packets,
the “slice copy” feature of the JM reference software decoder is utilized. Simulations are run over hundreds
of each test video, using different starting positions of the error patterns to obtain average results for a
given PLR. Objective results (e.g., PSNR of left and right views) are calculated for each of these simulation
runs. Then results are averaged over all the simulation runs to obtain the average quality results for each
sequence. The sequence with the PSNR closest to the average PSNR obtained over all the simulation runs
is selected for the subjective quality evaluation tests.
7.3 Subjective experiments
The subjective quality evaluation procedure to analyze the effect of packet losses on the perceived quality
of stereoscopic video is described in this subclause. The resultant quality of the synthesized binocular video
is rated by subjects for the perceived overall 3D image quality.
7.3.1 Method
7.3.1.1 Design
The experiment has a mixed design with test sequences (nine left and right 3D video sequences) and packet
losses (five PLRs) as within subject factors, and one evaluation concept (i.e., overall 3D image quality)
tested among subjects. Both image quality and depth perception are considered as major factors
contributing to the overall 3D image quality, and this is explained to all subjects during the training
sessions conducted before the test.
7.3.1.2 Observers
Sixteen non-expert observers (five female observers and eleven male observers) volunteered to participate in
the experiment. The observers are mostly research students and staff with a technical background. Their age
ranges from 20 to 45. Ten observers had prior experience with different formats of 3D video content, such as
3D movies in IMAX theaters. All participants have a visual acuity of 1 (as tested with the Snellen chart), good
stereo vision ≤ 60 seconds of arc (as tested with the Butterfly stereo test), and good color vision (as tested with
the Ishihara test). The perceptual attribute, namely the perceived overall 3D image quality, is assessed by all
the observers in two different sessions, about 20 mins each. During the initial test, five 3D sequences were
assessed, and the rest of the sequences were assessed in the second phase of the experiment in a different time
instance. This avoids any fatigue caused by 3D viewing for a long period of time. In order to understand the
comfort level, subjects are asked to complete a symptom list (i.e., four-level category scale: none, slight,
moderate, severe) before and after each subjective test session (Figure 18). The results of the test do not show
any significant increase in symptom levels during the course of the tests.
44
IEEE Std 3333.1.2-2017
Figure 18 —Symptoms indicator
7.3.1.3 Stimuli
Nine left and right stereoscopic video sequences, namely Lovebird 1, Newspaper, Kendo, Cafe, Beergarden,
Ballroom, Mobile, Car, and Horse, are used in this experiment. The stimulus set contains five scene
versions which are decoded after corrupting the H.264/AVC bit-streams using five different error patterns
(i.e., 0%, 1%, 3%, 5%, and 10% PLR). The original, uncompressed version of each scene is used as the
reference in the evaluation test. All combinations of corrupted 3D video sequences are presented twice to
the viewer. Therefore, in total nine test sequences, one repetition and six PLR values (including the
reference) are used. This resulted in a stimulus set of 9 × 2 × 6 = 108 3D video sequences.
7.3.1.4 Equipment
A 47-inch LG polarized stereoscopic display is used in the experiment to display the stereoscopic video
material. The optics of this display are optimized for a viewing distance of 2 m. Hence, the viewing
distance for the observers is set to 2 m. This viewing distance is in compliance with the preferred viewing
distance (PVD) of the ITU-R BT.500-12 Recommendation, which specifies the methodology for subjective
assessment of the quality of television pictures. The 3D display is calibrated using a GretagMacbeth Eye-
One Display 2 calibration device. The peak luminance of the display is 200 cd/m2. The measured
environmental illumination is 190 lux. The background luminance of the wall behind the monitor is 20 lux.
The ratio of luminance of inactive screen to peak luminance is less than 0.02. These environmental
luminance measures remained the same for all test sessions, as the lighting conditions of the test room are
kept constant. The original and distorted 3D image sequences are initially combined into top-bottom video
sequences. These top-bottom image sequences are then input to the 3D display to allow subjects to evaluate
the overall image quality and depth perception. A proprietary software designed by Kingston University-
London is utilized to display the sequences on the display automatically. Subjects mark the test sequences
in a five-level categorical scale (i.e., bad, poor, fair, good, and excellent) according to the double stimulus
continuous quality scale (DSCQS) method.
7.3.2 Procedure
A set of 108 stereoscopic video sequences is randomized and presented sequentially. A subjective test
experiment lasts for approximately 20 minutes. Before starting the test, observers are asked to read the
instructions explaining the test procedure and the attributes they had to rate. Thereafter, a trial session is
conducted before the actual test for the viewers to get acquainted with the 3D display and the range of
different PLRs associated with the stimulus set. The training is divided into two stages. The first stage used
commercially available 3D video footage for the viewers to get acquainted with the 3D display and 3D
viewing in general. The second stage of training provides the sequences from the test which cover a range
of distortion levels.
45
IEEE Std 3333.1.2-2017
7.3.3 Results and discussion
All subjects’ results are analyzed according to the Annex 2 (analysis and presentation of results) of ITU-R
BT.500-12 Recommendations. The mean opinion score and the relevant 95% confidence interval are
calculated for each presentation. Observer screening is carried out after analyzing the reliability of
individual opinion scores. No outliers were found for all the tests considered.
Figure 19 shows the subjective results (DMOS) distribution for all the test sequences considered in the test.
This shows the uniform distribution of the subject scores over all the considered quality levels for the
sequences selected for the database.
Figure 19 —DMOS for all 3D test sequences
Table 4 shows the mean DMOS, the standard deviation, and the 95% confidence interval for the selected
3D test video sequences and for all the sequences in average. Figure 20 shows the calculated DMOS scores
for the perceived overall 3D image quality of the “Lovebird 1” sequence with different PLRs. The 95%
confidence interval (CI) for each DMOS score is also presented. The x-axis and y-axis respectively
represent the different PLRs applied to the H.264/AVC coded 3D video sequences and DMOS scores from
0 (excellent) to 5 (bad).
Table 4—Statistics related to subjective test scores

Sequence Newspaper Cafe
PLR 0% 1% 3% 5% 10% 0% 1% 3% 5% 10%
Mean 0.27 0.42 0.97 1.23 1.94 0.06 0.84 1.94 2.38 2.81
STD 0.5 0.27 0.71 0.74 0.88 0.47 1.15 0.70 0.68 0.84
95% CI 0.24 0.13 0.35 0.36 0.43 0.23 0.56 0.34 0.33 0.41
Sequence Car All the sequences
PLR 0% 1% 3% 5% 10% 0% 1% 3% 5% 10%
Mean 0.69 1.14 1.58 2.09 2.64 0.20 1.01 1.57 1.91 2.49
STD 0.43 0.63 0.60 0.65 0.82 0.25 0.36 0.38 0.39 0.32
95% CI 0.21 0.31 0.29 0.32 0.40 0.16 0.24 0.25 0.25 0.21
46
IEEE Std 3333.1.2-2017
Figure 20 —PLR versus DMOS for “Lovebird 1” sequence
From Figure 20 it can be observed that, as expected, the perception of 3D image quality degrades with
increasing PLRs for the “Lovebird 1” sequence.
Figure 21 shows the DMOS scores for the perceived overall 3D image quality with different PLRs for all
the sequences considered. The 95% confidence interval (CI) for each DMOS score is also presented.
Figure 21 —PLR versus DMOS for all the sequences
8. A database of stereoscopic images plus depth
8.1 Description of the proposed stereoscopic image database plus depth
8.1.1 Source sequences
Fourteen scenes with representative content characteristics are selected for building the dataset. Figure 22
shows some examples of the scenes. Table 5 shows the sizes and names of the fourteen scenes. Fourteen
scenes are indoor (Newspaper, Book_Arrival, Kendo, Balloons, Champagne_tower, Pantomime,
Poznan_Hall1, Poznan_Hall2), and outdoor scenes (Lovebird1, GT_Fly, Shark, Microworld, Poznan_Carpark,
47
IEEE Std 3333.1.2-2017
Poznan_street), computer graphics scenes (GT_Fly, Shark, Microworld), and natural scenes (Newspaper,
Book_Arrival, Kendo, Balloons, Lovebird1, Champagne_tower, Pantomime, Poznan_Hall1, Poznan_Hall2,
Poznan_CarPark, Poznan_Street). All the depth maps of 14 scenes are available.
Figure 22 —An example of 14 scenes in the dataset
Table 5—Size and name of the 14 scenes

Resolution Scene name
Newspaper
Book_Arrival
1024 × 768 Kendo
Balloons
Lovebird1
Champagne_tower
1280 × 960
Pantomime
GT_Fly
Shark
Microworld
1920 × 1088 Poznan_Hall1
Poznan_Hall2
Poznan_CarPark
Poznan_Street
48
IEEE Std 3333.1.2-2017
8.1.2 Test sequences
The stereoscopic image dataset consists of 308 stereo pairs. The input is the color maps of three views and
the corresponding depth maps.
The homogeneity-detection model in 5.7 is used to divide depth maps into the Tex and Homo regions. The
minimum processing unit applied during detection is a CTU, which is compatible with the basic processing
unit defined in 3D-HEVC, consisting of 64 × 64 pixels. High gradient means Tex pixel. If the number of
Tex pixel is bigger than a threshold, the CTU will be determined to be a Tex CTU, otherwise the CTU will
be determined to be a Homo CTU. Based on the detection results, different levels of distortion are
introduced to different regions of depth maps by encoding the color maps and depth maps using a modified
reference software of 3D-HEVC, 3D-HTM 6.1. Different distortion ratios between the Tex CTUs and
Homo CTUs are generated for the depth maps by compressing the Tex CTUs using QP defined in the
common test condition, as 34 and 45, corresponding to high bitrate and low bitrate, while varying the QP of
Homo CTUs in [Tex QP-10, Tex QP + 10], and taking 2 as a step. The color maps are compressed using
QP 25 and 40, corresponding to high bitrate and low bitrate. The detailed QP values are shown in Table 6.
Defining the average distortion of Tex CTUs in depth map as DT, and the average distortion of Homo
CTUs in depth map as DH, for each scene, 11 texture and depth pairs corresponding to 11 DT/DH ratios can
be generated. After decoding and synthesizing view using the View Synthesis Reference Software, two
intermediate views suggested by the common test conditions are selected as the stereo pair. Table 7 shows
the view number for the stereo pair. Finally, 14 × 2 × 11 = 308 stereo pairs form the stereoscopic image
dataset, where 14 represents the number of scenes; 2 represents the two QPs of color maps, corresponding
to low bitrate and high bitrate, respectively; 11 represents 11 DT/DH ratios of coded depth maps.
Table 6—The detailed QP

Color map QPT QPH
High bitrate 25 34 24 26 28 30 32 34 36 38 40 42 44
Low bitrate 40 45 35 37 39 41 43 45 47 48 49 50 51
Table 7—The detailed information for 14 scenes

Resolution Scene name Input view no. Baseline View no. for stereo pair
Newspaper 426 5 cm 3.5, 4.5
BookArrival 8 10 6 6.5 cm 8.5, 7.5
1024 × 768 Kendo 315 5 cm 2.5, 3.5
Balloons 315 5 cm 2.5, 3.5
Lovebird1 648 3.5 cm 5, 7
Champagne_tower 39 37 41 5 cm 38.5, 39.5
1280 × 960
Pantomime 39 37 41 5 cm 38.5, 39.5
GT_Fly 591 5 cm 4.5, 5.5
Shark 159 5 cm 4.5, 5.5
Microworld 159 5 cm 4.5, 5.5
1920 × 1088 Poznan_Hall1 231 13.75 cm 2.25, 1.75
Poznan_Hall2 231 13.75 cm 2.25, 1.75
Poznan_CarPark 453 13.75 cm 4.25, 3.75
Poznan_Street 453 13.75 cm 4.25, 3.75
8.2 Subjective experiment
8.2.1 Method
For the scenes in the dataset, subjective tests were performed to get the mean opinion score (MOS) for each
stereoscopic image.
49
IEEE Std 3333.1.2-2017
8.2.2 Subjects
Ten females and ten males, aged between 20 and 28, participated in the test. They were all non-expert.
All the setups followed the ITU-R BT.500. Twenty volunteers were scientific research workers, including
18 graduate students and two doctoral students. Two were postgraduates in the field of image processing,
but had no specific knowledge in image-quality assessment. And 18 were other graduate students in other
fields. Visual acuity-related tests were performed on the experimental participants. All volunteers had
normal range of vision, normal stereoscopic perception, and no one was color-blind.
8.2.3 Stimuli
The established stereoscopic image dataset consists of 308 3D images with SBS (side-by-side) format.
8.2.4 Equipment
The stereoscopic 3D images were displayed on an ASUS VG278HE monitor (27-inch display with
resolution 1920 × 1080) with NVIDIA GPU GTX960 support, and the participants wore a pair of 3D
Vision 2 wireless active shutter glasses.
For the software, the Windows 7 64-bit system was adopted. The NVIDIA GeForce 3D Vision driver was
installed, then the 3D VISION installation wizard at the start menu > NVIDIA > 3D Vision is accessed.
Once set, 3D images can be viewed in the Start menu > NVIDIA > 3D Vision > 3D Vision Preview Pack 1.
If the 3D effect of the picture is seen, it proves that the 3D display device builds successfully.
The 3D viewing distance of the 27-inch display is 1 meter, so the viewing distance was set at 1 meter.
Fluorescent lamp refresh frequency is lower than the 3D Vision 2 glasses’ minimum 120 Hz flashing
frequency requirements, so fluorescent lamps are not allowed to be in front of the viewer in the
experimental room. The brightness of the experimental room is 190 lux and the brightness remains constant
during each viewing process. The surrounding background of the monitor should be as simple as possible
to avoid distracting the viewer’s attention. The room was spacious, quiet, and comfortable for the viewers
during the viewing.
8.2.5 Procedure
For the subjective evaluation method, the absolute categorical rating recommended by the ITU-R BT.500
standard was used. One 3D picture was viewed at the same time and the pictures in the database were played
randomly. The scenes were mixed and two consecutive 3D images cannot come from the same video scene.
NVIDIA 3D Photo Viewer was utilized as the photo slideshow, which can play the pictures in slide mode.
Volunteers needed a visual relaxation between two 3D images, so a grey image would be played between two
3D images. Assessors used a form which showed clear and different scales, and had numbered boxes.
Each subject was asked to assign a visual discomfort score to each stereo pair using the following scale:
5 = very comfortable, very clear, and very good 3D perception; 4 = comfortable, clear, and good 3D
perception; 3 = mildly comfortable, mildly clear, and mildly good 3D perception; 2 = uncomfortable, not clear,
or bad 3D perception; 1 = extremely uncomfortable, extremely not clear, and extremely bad 3D perception.
Before the subjective experiment, each volunteer was asked to read the introduction document. The
document first explained that the subjective experiment was to assess the subjective perception of the 3D
image, and explained the assessment principle to the volunteers. Then, the volunteers were told that the
entire process of the test and the amount of task time took about 30 minutes. A 5-minute test session was
used to train 3D senses. The training session covered more scenes and different 3D qualities, helping
volunteers familiar with picture scenes, assessment criteria, and assessment process. At the end of the
experiment, the volunteers were asked to provide personal information, such as age, gender, etc., for
experimental statistics.
50
IEEE Std 3333.1.2-2017
8.2.6 Results and discussion
Defining the average distortion of Tex CTUs in depth map as DT, and the average distortion of Homo
CTUs in depth map as DH, for each scene, 11 texture and depth pairs corresponding to 11 DT/DH ratios can
be generated. DT/DH, total bitrate, depth bitrate, PSNR of depth maps and color maps, PSNR of synthesized
views, together with the number of Tex and Homo CTUs in the depth maps are recorded for each scene.
Using scene “Champagne” whose color maps is coded at 25 and 40 as an instance, Figure 23(a) depicts the
variation in DT/DH with the relative difference between QP of Homo CTUs and Tex CTUs (denoted by
(Homo QP-Tex QP). Figure 23 (b) depicts PSNR variation in synthesized views with (Homo QP-Tex QP).
Figure 23 (c) depicts MOS variation of synthesized 3D picture with (Homo QP-Tex QP). As shown in the
figure, both DT/DH of depth map and synthesized view PSNR are affected by the depth distortion variation,
but DT/DH of depth map changes a lot while synthesized view PSNR shows slight difference. Besides, the
PSNR decreases with (Homo QP-Tex QP), but the MOS shows no decrease. This means humans cannot
notice some distortion of the Homo region, but the change between high bitrate and low bitrate is obvious.
(a) (b) (c)

Figure 23 —Scene “Champagne”: (a) depth map DT/DH versus (Homo QP-Tex QP),
(b) synthesized view PSNR versus (Homo QP-Tex QP),
(c) synthesized 3D picture MOS versus (Homo QP-Tex QP)
51
IEEE Std 3333.1.2-2017
Annex A
(informative)
Bibliography
Bibliographical references are resources that provide additional or helpful material but do not need to be
understood or used to implement this standard. Reference to these resources is made for informational use
only.
[B1] Appuhami, H. D., M. G. Martini, and C. T. E. R. Hewage, “Using 3D Structural Tensors in Quality
Evaluation of Stereoscopic Video,” IEEE Visual Communications and Image Processing (IEEE VCIP
2014) Conference, Valletta, Malta, 7–10 Dec. 2014.
[B2] Banks, M. S., S. Gepshtein, and M. S. Landy, “Why Is Spatial Stereoresolution So Low?” Journal of
Neuroscience, vol. 24, no. 9, pp. 2077–2089, Mar. 2004.
[B3] Battisti, F., E. Bosc, M. Carli, P. Le Callet, and S. Perugia, “Objective Image Quality Assessment of
3D Synthesized Views,” Signal Processing: Image Communication, doi:10.1016/j.image.2014.10.005.
[B4] Beverley, B. I., and D. Regan, “Visual Perception of Changing Size: The Effect of Object Size,”
Journal of Vision Research, vol. 19, no. 10, pp. 1093–1104, 1979.
[B5] Blake, R., “A Primer on Binocular Rivalry, Including Current Controversies,” Brain and Mind, vol.
2, pp. 5–38, 2001.
[B6] Blake, R., D. H. Westendorf, and R. Overton, “What Is Suppressed During Binocular Rivalry?”
Perception, vol. 9, no. 2, pp. 223–231, 1980.
[B7] Bosc, E., M. Koppel, R. Pepion, M. Pressigout, L. Morin, P. Ndjiki-Nya, and P. Le Callet, “Can 3D
Synthesized Views Be Reliably Assessed Through Usual Subjective and Objective Evaluation Protocols?”
International Conference on Image Processing, Brussels, 2011.
[B8] Cormack, L. K., D. D. Lander, and S. Ramakrishnan, “Element Density and the Efficiency of
Binocular Matching,” Journal of the Optical Society of America, vol. 14, no. 4, pp. 723–730, Apr. 1997.
[B9] Haber, M., and H. Barnhart, “Coefficients of Agreement for Fixed Observers,” Statistical Methods
in Medical Research, vol. 15, no. 3, p. 255, 2006.
[B10] Hewage, C. T. E. R., H. Appuhami, M. G. Martini, R. Smith, T. Rockall, and I. Jourdan, “Quality
Evaluation of Compressed 3D Surgical Video,” SSH: IEEE International Workshop on Service Science for
eHealth, Natal, Brazil, 15–18 Oct. 2014. 5, 6
[B11] Hewage, C. T. E. R., and M. G. Martini, “Edge Based Reduced-Reference Quality Metric for 3D
Video Compression and Transmission,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 5,
pp. 471–482, 2012.
[B12] Hewage, C. T. E. R., and M. G. Martini, “Quality of Experience for 3D Video Streaming,” IEEE
Communications Magazine, vol. 51, no. 5, 2013.
[B13] Hewage C. T. E. R., and M. G. Martini, “Reduced-Reference Quality Assessment for 3D Video
Compression and Transmission,” IEEE Transactions on Consumer Electronics, vol. 57, no. 3, pp. 1185–
1193, 2011.
5
The IEEE standards or products referred to in Annex A are trademarks owned by the Institute of Electrical and Electronics Engineers,
Incorporated.
6
IEEE publications are available from the Institute of Electrical and Electronics Engineers (http://standards.ieee.org/).
52
IEEE Std 3333.1.2-2017
[B14] Hewage, C. T. E. R., M. G. Martini, H. D. Appuhami, and C. Politis, “Real-Time 3D QoE

Evaluation of Novel 3D Media,” in A. Kondoz, and T. Dagiuklas (eds.), Novel 3D Media Technologies.
New York: Springer, pp. 163–184, 2015.
[B15] Hewage, C. T. E. R., S. T. Worrall, S. Dogan, and A. M. Kondoz, “Quality Evaluation of Color Plus
Depth Map-Based Stereoscopic Video,” IEEE Journal of Selected Topics in Signal Processing, vol. 3, no. 2,
pp. 304–318, 2009.
[B16] IRCCyN/IVC DIBR Images, “Subjectives Scores for DIBR Algorithms Using ACR-HR and Pair
Comparisons” [online database], available: http://www.irccyn.ec-nantes.fr/spip.php?article865.
[B17] Kim, H., S. Ahn, W. Kim, and S. Lee, “Visual Preference Assessment on Ultra-High-Definition
Images,” IEEE Transactions on Broadcasting, vol. 62, no. 4, pp. 757–769, Dec. 2016.
[B18] Kim, H., S. Lee, and A. C. Bovik, “Saliency Prediction on Stereoscopic Videos,” IEEE Transactions
on Image Processing, vol. 23, no. 4, pp. 1476–1490, Apr. 2014.
[B19] Lee, S., and A. C. Bovik, “Fast Algorithms for Foveated Video Processing,” IEEE Transactions on
Circuits Systems for Video Technology, vol. 13, no. 2, pp. 149–162, Feb. 2003.
[B20] Lee, K., and S. Lee, “3D Perception Based Quality Pooling: Stereopsis, Binocular Rivalry, and
Binocular Suppression,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 3, pp. 533–545,
Jan. 2015.
[B21] Lee, K., A. K. Moorthy, S. Lee, and A. C. Bovik, “3D Visual Activity Assessment Based on Natural
Scene Statistics,” IEEE Transactions on Image Processing, vol. 23, no. 1, pp. 450–465, Jan. 2014.
[B22] Lee, S., M. S. Pattichis, and A. C. Bovik, “Foveated Video Compression with Optimal Rate
Control,” IEEE Transactions on Image Processing, vol. 10, no. 7, pp. 977–992, Jul. 2001.
[B23] Lee, S., M. S. Pattichis, and A. C. Bovik, “Foveated Video Quality Assessment,” IEEE Transactions
on Multimedia, vol. 4, no. 1, pp. 129–132, Mar. 2002.
[B24] Levelt, W. J. M., On Binocular Rivalry, The Hague; Paris: Mouton, 1968.
[B25] Li, F., L. Shen, D. Wu, and R. Fang, “Full-Reference Quality Assessment of Stereoscopic Images
Using Disparity-Gradient-Phase Similarity,” 2015 IEEE China Summit and International Conference on
Signal and Information Processing (ChinaSIP), pp. 658–662, 12–15 July 2015.
[B26] Malpica, W., and A. Bovik, “SSIM Based Range Image Quality Assessment,” Fourth International
Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, AZ, 15–16 Jan.
2009.
[B27] Meenes, M., “A Phenomenological Description of Retinal Rivalry,” American Journal of
Psychology, vol. 42, pp. 260–269, 1930.
[B28] Moorthy, A. K., and A. C. Bovik, “Visual Importance Pooling for Image Quality Assessment,” IEEE
Journal of Selected Topics in Signal Processing, vol. 3, no. 2, pp. 193–201, Apr. 2009.
[B29] Nasralla, M. M., C. T. E. R. Hewage, and M. G. Martini, “Subjective and Objective Evaluation and
Packet Loss Modeling for 3D Video Transmission over LTE Networks,” International Conference on
Telecommunications and Multimedia (TEMU), Heraklion, Greece, 28–30 July 2014.
[B30] Neri, P., “A Stereoscopic Look at Visual Cortex,” Journal of Neurophysiology, vol. 93, pp. 1823–
1826, 2005.
[B31] Oliveira, V., and A. Conci, “Skin Detection Using HSV Color Space,” Proceedings of SIBGRAPI
2009, XXII Brazilian Symposium on Computer Graphics and Image Processing, Rio de Janeiro, Brazil, 11–
15 Oct. 2009.
[B32] Park, J., K. Seshadrinathan, S. Lee, and A. C. Bovik, “VQPooling: Video Quality Pooling Adaptive
to Perceptual Distortion Severity,” IEEE Transactions on Image Processing, vol. 22, no. 4, Feb. 2013.
53
IEEE Std 3333.1.2-2017
[B33] Recommendation ITU-R BT.1438: Subjective Assessment of Stereoscopic Television Pictures. 7

[B34] Solh, M., G. AlRegib, and J. Bauza, “3VQM: A Vision-Based Quality Measure for DIBR-Based 3D
Videos,” 2011 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, Barcelona, Spain,
2011.
[B35] Walther, D., U. Rutishauser, C. Koch, and P. Perona, “On the Usefulness of Attention for Object
Recognition,” Workshop on Attention and Performance in Computational Vision at ECCV, pp. 96–103,
Prague, Czech Republic, 15 May 2004.
7
ITU-T publications are available from the International Telecommunications Union (http://www.itu.int/).
54
I
EEE
st
andards
.i
eee.
org
Phone:+17329810060 Fax:+17325621571
©IEEE

IEEE Standard For The Perceptual Quality Assessment of Three-Dimensional (3D) and Ultra-High-Definition (UHD) Contents

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IEEE Standard For The Perceptual Quality Assessment of Three-Dimensional (3D) and Ultra-High-Definition (UHD) Contents

Uploaded by

Copyright:

Available Formats

IEEE Standard for the

Perceptual Quality Assessment

IEEE Computer Society

IEEE Standard for the

Standards Activities Board

Approved 6 December 2017

IEEE-SA Standards Board

The Institute of Electrical and Electronics Engineers, Inc.

Copyright © 2018 by The Institute of Electrical and Electronics Engineers, Inc.

PDF: ISBN 978-1-5044-4659-4 STD22981

IEEE prohibits discrimination, harassment, and bullying.

Notice and Disclaimer of Liability Concerning the Use of IEEE Standards

Comments on standards should be submitted to the following address:

Secretary, IEEE-SA Standards Board

Laws and regulations

Updating of IEEE Standards documents

Sanghoon Lee, Chair

Sewoong Ahn Hezerul Abdul Karim Maria G. Martini

Iwan Adhicandra Juan Carreon Piotr Karocki

Jean-Philippe Faure, Chair

Chuck Adams Thomas Koshy Robby Robson

3. Definitions, acronyms, and abbreviations ................................................................................................ 10

4. An overview of the standard..................................................................................................................... 12

5. Quality assessment for 3D contents.......................................................................................................... 13

6. Quality-of-experience assessment for ultra-high-definition contents based on the human visual

7. The impact of compression and packet losses on 3D perception ............................................................. 43

8. A database of stereoscopic images plus depth .......................................................................................... 47

Annex A (informative) Bibliography ........................................................................................................... 52

 Compression distortion, such as multi-view image and video compression

Recommendation ITU-P.910, Subjective Video Quality Assessment Methods for Multimedia

3. Definitions, acronyms, and abbreviations

subjective assessment: An assessment of quality or visual discomfort where there is no pre-established

viewing distance: Distance between the viewer and display.

3.2 Acronyms and abbreviations

PSNR peak signal-to-noise ratio

4. An overview of the standard

4.1 Quality assessment for 3D contents

4.2 Quality assessment for ultra-high-definition contents

5. Quality assessment for 3D contents

5.2 Full-reference quality assessment of stereoscopic images using disparity-

Figure 1 —Disparity-gradient-phase-congruency similarity (DGP-SIM) framework

5.2.2 Cyclopean image

WL is the weighting parameters of the left image I L  x, y 

5.2.3 Stereoscopic image-quality metric

DP1 ( x) is the disparity map of reference stereo pairs

The value of S DP is between 0 and 1.

GM 1  x  represents the gradient magnitude map of reference cyclopean images

PC1  x  is the gradient magnitude map of the reference cyclopean image

S SIQA ( x) = [ S DP ( x)]α [ SGM ( x)]β [ S PC ( x)]γ (8)

5.2.4 Overhead analysis

5.3 Reduced reference real-time quality assessment of stereoscopic images/video

5.3.2 Stereo image-quality evaluation

In order to quantify structural/disparity comparison, luminance comparison, and contrast comparison

The SSIM index between signal x and y is as follows:

SSIM ( x, y ) = [l ( x, y )]α [c( x, y )]β [ s ( x, y )]γ (9)

SSIM ( x, y ) represents the similarity map

x is a block vector (i.e., local 8 × 8 square window) of image X

X is the reference image

Qleft ( x, y ) = [l ( x, y )]α [c( x, y )]β [ s ( x′, y ′)]γ (14)

l ( x, y ) is luminance comparisons performed on original and processed left images

Then the overall left view quality is calculated as follows:

A simplified block diagram of the proposed method is shown in Figure 4.

Figure 4 —Block diagram of the proposed reduced-reference method