PESQ Information
PESQ Standard Hotlinks
PESQ White Papers
PESQ Basic Algorithm
PESQ Basic-Plus Algorithm
PESQ Advanced Algorithm
Speech Software Solution
Speech System Solution
Speech Lab Solution
Speech Quality Monitoring
 
 
 
 

PESQ Overview


PESQ is the new ITU-T standard for measuring the voice quality of communications networks.

This overview explains the motivation behind voice quality measurement, describes the development of PESQ, and gives an overview of the components that make up the model. In addition it describes some applications of PESQ and gives some typical PESQ scores for a range of common network conditions.

Voice Quality

Motivation

End-to-end speech quality is the key measure of voice Quality of Service (QoS). Assessment is essential for equipment selection, monitoring, fault-finding, service level agreements and optimization of networks. Getting quality right can make a major difference both to customer satisfaction and to the cost of providing a service.

Quality in networks will remain an issue as long as bandwidth and processing power are limited. This applies across networks of all types. In mobile networks, bandwidth to the customer is expensive. Quality measurement means that the network can be engineered to deliver the right quality at the right cost. In Voice over IP (VoIP, Internet telephony), performance is also an issue and operators tend to over-provision. Using the right tools to monitor quality can stop over-provisioning and allow networks to service more customers and therefore make more money.

Factors that affect quality include:

  • • Low bit-rate coding
  • • Errors (mobile or packet)
  • • Background noise
  • • Silence suppression
  • • Filtering by handsets or the access network

Measuring Voice Quality

The traditional method of determining voice quality is to conduct subjective tests with panels of human listeners. Extensive guidelines are given in ITU-T recommendations P.800/P.830. The results of these tests are averaged to give mean opinion scores (MOS) but such tests are expensive and are impractical for testing in the field.

For this reason the ITU recently standardized a new model, PESQ (ITU-T recommendation P.862), that automatically predicts the quality scores that would be given in a typical subjective test. This is done by making an intrusive test, as shown in Figure 1, and processing the test signals through PESQ.


Figure 1: Use of PESQ

Development of PESQ

Perceptual models for quality assessment

Modeling perception – specifically human auditory perception – is the core concept behind PESQ and its predecessors. This concept dates back to the late 1970s, when Manfred Schroeder introduced it for speech coding.

Signal compression algorithms, used in modern speech and audio codecs, use perceptual information to decide which parts of a signal to code and which to discard. For example, the MPEG audio codecs use a model of “perceptual masking” to decide how many bits to use for coding each frequency, and which frequencies need not be coded at all.

Simple measures like SNR do not give an accurate measure of the quality of these systems – perceptually masked coding noise, at a typical SNR of 13dB, can be completely inaudible, whereas random noise at the same value of SNR would be extremely disturbing.

Matti Karjalainen first reported the use of a perceptual model for quality assessment in 1985. A perceptual model is used to correctly distinguish between audible and inaudible distortions and this has proven to be the best way of accurately predicting the audibility and annoyance of complex distortions.

Mike Hollier at BT Labs and John Beerends of KPN Research led subsequent innovations in the 1990s on the use of perception for voice quality assessment. Hollier observed that taking account not just of the amount, but also the distribution, of audible distortion could make quality predictions much more accurate.

It was not until 1996, following a lengthy international study, that perceptual models for quality assessment were first standardized. The result of this was that Beerends' model, PSQM, became an ITU-T recommendation (P.861) for assessing speech codecs.

Problems with PSQM

It soon became clear that PSQM was not suitable for testing networks, where speech codecs are only one part of a complex chain. PSQM was found to correlate very poorly with subjective opinion in some commonly-occurring situations

  • • speech clipping
  • • background noise
  • • packet loss in VoIP networks
  • • filtering in analogue elements (such as handsets or 2-wire access loops)
  • • variable delay (common in VoIP).

The extent to which PESQ had problems was illustrated well by one subjective test. The test contained a range of network conditions including filtering and VoIP. The correlation achieved by PSQM against subjective MOS was only 0.26 whereas an ideal model would have a correlation of 1. PESQ, for the same test, has a correlation of 0.93.

Standardization of PESQ as P.862

A number of organizations pressed the ITU-T to select a replacement for PSQM that would be more suitable for testing networks. To this end ITU-T study group 12 held a competition from September 1998 to March 2000. The following companies took part: KPN (with PSQM99, an extended and improved version of PSQM), BT (with PAMS), Ascom, Ericsson and Deutsche Telekom.

The outcome of the competition was a clear division of the models into two groups. The winners were PSQM99 and PAMS but unfortunately there was statistically no single winner. PSQM performed better on certain conditions of rapid gain variations and severe temporal clipping whereas PAMS performed better on conditions of VoIP and filtering.

The second group all had significantly lower average correlation and showed shortcomings on many more of the condition types. PSQM, PSQM+ and MNB had poorer performance still.

It was therefore decided to integrate the best two models, PSQM99 and PAMS, to produce a single model that would be a best of breed. For this model to be accepted it was decided by the ITU-T that it would need to outperform both PSQM99 and PAMS by passing even more demanding performance tests. The KPN group consequentially collaborated with BT group to achieve this. The result was PESQ.

In May 2000 PESQ passed all of the new performance criteria and was submitted for standardization as P.862. This process completed in February 2001 with the final approval of P.862 and the withdrawal of P.861.

How PESQ works

Algorithm overview

PESQ measures one-way voice quality: a signal is injected into the system under test, and the degraded output is compared by PESQ with the input (reference) signal.

The test signals must be speech-like because many systems are optimized for speech and respond in an unrepresentative way to non-speech signals (e.g. tones, noise, ITU-T P.50).


Figure 2: Structure of PESQ

The processing carried out by PESQ is illustrated in Figure 2. The model includes the following stages.

Level alignment . In order to compare the signals, the reference speech signal and the degraded signal are aligned to the same constant power level. This corresponds to the normal listening level used in subjective tests.

Input filtering. PESQ models and compensates for filtering that takes place in the telephone handset and in the network.

Time alignment . The system may include a delay, which may change several times during a test - for example Voice over IP often has variable delay. PESQ can identify and account for delay changes.

Auditory transform. The reference and degraded signals are passed through an auditory transform that mimics key properties of human hearing. This transform removes those parts of the signal that are inaudible to the listener.

Disturbance processing . Disturbance parameters are calculated using non-linear averages over specific areas of the error surface:

  • • the absolute (symmetric) disturbance: a measure of absolute audible error
  • • the additive (asymmetric) disturbance: a measure of audible errors that are much louder than the reference

PESQ outputs

These disturbance parameters are converted to a PESQ score, which ranges from –1 to 4.5. Also available is a function to convert this to PESQ-LQ, which gives a P.800 MOS-like listening quality score between 1 and 5 (Table 1).

Score

Quality of the speech

5

Excellent

4

Good

3

Fair

2

Poor

1

Bad

Table 1: Listening quality scale

 

Applications of PESQ

PESQ can be used in a wide range of measurement applications. Being fast and repeatable, PESQ makes it possible to perform extensive testing over a short period and also enables the quality of time-varying conditions to be monitored.

Codec development. The impact of changes to a coding algorithm can be quickly investigated using the objective model, even if their effect is small. The model can also be used to explore how quality varies with bit rate, input level or channel errors.

Equipment selection. Codecs or other communications systems can be compared using PESQ. For example, PESQ has been successfully used to compare technologies and distortion scenarios for mobile networks, VoIP, and speech codecs.

Equipment optimization. It can be very difficult for a user to find the “correct” values given a choice of coder, input level, bit rate or buffer length. Using an objective model allows the QoE Systems to be found quickly, and is able to work on much smaller differences than could be measured in a conventional subjective test.

Monitoring. With a network of test devices to make regular measurement calls, PESQ can be used to benchmark the call quality of communications networks. As well as tracking quality over time or in varying conditions, the model can even help to identify problems before the customers notice.

 

PESQ scores for typical network conditions

Based on simulations and real measurements, Table 6 presents the results of a number of typical networks and codecs with no errors or packet loss. In addition, it gives the scores that can be expected in some mobile network conditions where errors are significant.

Please note that results can be affected by a number of factors; for example the test signal used. We averaged the scores from measurements with different speech material in four languages. Each measurement was 8 sec. long and used clean speech. The speech signals at the input to the network were MIRS send filtered and were at an active speech level of –26 dBov.

Network condition Typical PESQ score Typical PESQ-LQ score

Network Condition

Typical
PESQ score

Typical
PESQ-LQ score

Clean ISDN network

4.3

4.4

Analog network (G.711)

4.1

4.2

G.728 codec (16kbit/s)

3.8

3.9

G.729 codec (8kbit/s)

3.6

3.7

G.723.1 codec (6.3kbit/s)

3.5

3.4

GSM EFR codec (12.2kbit/s)

3.9

4.0

GSM FR codec (13kbit/s)

3.5

3.5

GSM-EFR mobile network in typical operating range

3.6 to 3.1

3.6 to 2.9

GSM-EFR mobile network in very poor conditions

2.2

1.6

 

Table 6: Typical PESQ scores for a range of network conditions