Study of speech signals and the processing methods of these signals
Speech processing
is the study of
speech
signals
and the processing methods of signals. The signals are usually processed in a
digital
representation, so speech processing can be regarded as a special case of
digital signal processing
, applied to
speech signals
. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include
speech recognition
,
speech synthesis
,
speaker diarization
,
speech enhancement
,
speaker recognition
, etc.
[1]
History
[
edit
]
Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple
phonetic
elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.
[2]
Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.
[3]
Linear predictive coding
(LPC), a speech processing algorithm, was first proposed by
Fumitada Itakura
of
Nagoya University
and Shuzo Saito of
Nippon Telegraph and Telephone
(NTT) in 1966.
[4]
Further developments in LPC technology were made by
Bishnu S. Atal
and
Manfred R. Schroeder
at
Bell Labs
during the 1970s.
[4]
LPC was the basis for
voice-over-IP
(VoIP) technology,
[4]
as well as
speech synthesizer
chips, such as the
Texas Instruments LPC Speech Chips
used in the
Speak & Spell
toys from 1978.
[5]
One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by
Lawrence Rabiner
and others at Bell Labs was used by
AT&T
in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.
[6]
By the early 2000s, the dominant speech processing strategy started to shift away from
Hidden Markov Models
towards more modern
neural networks
and
deep learning
.
[
citation needed
]
Techniques
[
edit
]
Dynamic time warping
[
edit
]
Dynamic time warping (DTW) is an
algorithm
for measuring similarity between two
temporal sequences
, which may vary in speed. In general, DTW is a method that calculates an
optimal match
between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.
[
citation needed
]
Hidden Markov models
[
edit
]
A hidden Markov model can be represented as the simplest
dynamic Bayesian network
. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the
Markov property
, the
conditional probability distribution
of the hidden variable
x
(
t
) at time
t
, given the values of the hidden variable
x
at all times, depends
only
on the value of the hidden variable
x
(
t
? 1). Similarly, the value of the observed variable
y
(
t
) only depends on the value of the hidden variable
x
(
t
) (both at time
t
).
[
citation needed
]
Artificial neural networks
[
edit
]
An artificial neural network (ANN) is based on a collection of connected units or nodes called
artificial neurons
, which loosely model the
neurons
in a biological
brain
. Each connection, like the
synapses
in a biological
brain
, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a
real number
, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.
[
citation needed
]
Phase-aware processing
[
edit
]
Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase:
[7]
result of
arctangent
function is not continuous due to periodical jumps on
. After phase unwrapping (see,
[8]
Chapter 2.3;
Instantaneous phase and frequency
), it can be expressed as:
[7]
[9]
, where
is linear phase (
is temporal shift at each frame of analysis),
is phase contribution of the vocal tract and phase source.
[9]
Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase
[10]
and its derivatives by time (
instantaneous frequency
) and frequency (
group delay
),
[11]
smoothing of phase across frequency.
[11]
Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.
[9]
Applications
[
edit
]
See also
[
edit
]
References
[
edit
]
- ^
Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned".
arXiv
:
1911.02388
[
eess.AS
].
- ^
Juang, B.-H.; Rabiner, L.R. (2006), "Speech Recognition, Automatic: History",
Encyclopedia of Language & Linguistics
, Elsevier, pp. 806?819,
doi
:
10.1016/b0-08-044854-2/00906-8
,
ISBN
9780080448541
- ^
Myasnikov, L. L.; Myasnikova, Ye. N. (1970).
Automatic recognition of sound pattern
(in Russian). Leningrad: Energiya.
- ^
a
b
c
Gray, Robert M. (2010).
"A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol"
(PDF)
.
Found. Trends Signal Process
.
3
(4): 203?303.
doi
:
10.1561/2000000036
.
ISSN
1932-8346
.
- ^
"VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development"
.
- ^
Huang, Xuedong; Baker, James; Reddy, Raj (2014-01-01). "A historical perspective of speech recognition".
Communications of the ACM
.
57
(1): 94?103.
doi
:
10.1145/2500887
.
ISSN
0001-0782
.
S2CID
6175701
.
- ^
a
b
Mowlaee, Pejman; Kulmer, Josef (August 2015).
"Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential"
.
IEEE/ACM Transactions on Audio, Speech, and Language Processing
.
23
(8): 1283?1294.
doi
:
10.1109/TASLP.2015.2430820
.
ISSN
2329-9290
.
S2CID
13058142
. Retrieved
2017-12-03
.
- ^
Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017).
Single channel phase-aware signal processing in speech communication: theory and practice
. Chichester: Wiley.
ISBN
978-1-119-23882-9
.
- ^
a
b
c
Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR".
Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
. IEEE. pp. 5063?5067.
- ^
Kulmer, Josef; Mowlaee, Pejman (May 2015).
"Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition"
.
IEEE Signal Processing Letters
.
22
(5): 598?602.
Bibcode
:
2015ISPL...22..598K
.
doi
:
10.1109/LSP.2014.2365040
.
ISSN
1070-9908
.
S2CID
15503015
. Retrieved
2017-12-03
.
- ^
a
b
Mowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016).
"Advances in phase-aware signal processing in speech communication"
.
Speech Communication
.
81
: 1?29.
doi
:
10.1016/j.specom.2016.04.002
.
ISSN
0167-6393
.
S2CID
17409161
. Retrieved
2017-12-03
.