在数字语音处理的大多数应用中,第一步是将声波形转换为数字序列。这种离散时间表示是大多数应用程序的起点。从这一点出发,通过数字处理可以获得更强大的表示。在大多数情况下,这些替代表示是基于将DSP操作与有关语音链工作原理的知识相结合的,如图1.4所示。正如我们将看到的,可以将语音产生和语音感知过程的各个方面合并到数字表示和处理中。随着我们讨论的进行,将很明显地断言数字语音处理是基于一组技术的,不过这些技术的目标是将语音表示的数据速率推向左(即降低数据)。率)沿图1.4中的上或下路径。
1.3.1本章的其余部分专门概述了数字语音处理的应用。即人们每天与之互动的系统。我们的讨论将强调所选数字表示形式在所有应用领域中的重要性。
语音编码也许
数字语音处理技术的最广泛应用发生在语音信号的数字传输和存储领域。在这些区域中,数字表示的中心性很明显,因为目标是将语音的数字波形表示压缩为较低的比特率表示。通常将此活动称为“语音编码”或“语音压缩”。我们在第1.1节中对语音信息内容的讨论表明,仍有很大的压缩空间。通常,这种压缩是通过将DSP技术与语音产生和感知过程的基础知识相结合来实现的。
图1.6显示了通用语音编码/解码(或压缩)系统的框图。在该图的上部,模数转换器将模拟语音信号x(t)转换为采样波形表示x [n]。通过数字计算算法对数字信号x [n]进行分析和编码,以生成新的数字信号y [n],该信号可以通过数字通信信道传输或以y [n]的形式存储在数字存储介质中。正如我们将看到的,特别是在语音的第11章和音频信号的第12章中,有无数种方法可以进行编码,以降低数据速率,使其超过采样和量化波形xIn1的数据速率。因为此时的数字表示通常不直接与采样的语音波形相关,所以y [n]和y [n]被适当地称为代表语音信号的数据信号。图1.6的下部路径显示了与语音编码器关联的解码器。使用分析处理的逆过程对接收到的数据信号ý{n]进行解码,给出采样序列{[n],然后将其转换(使用D-A转换器)回到模拟信号(t)供人类聆听。解码器通常称为合成器,因为它必须从通常与原始波形样本没有直接关系的数据中重构语音波形。
通过精心设计的数字表示形式的错误保护编码,发送的数据(y [n])和接收的数据(n)可以基本相同。这是数字编码的典型特征。从理论上讲,即使在非常嘈杂的信道条件下,也可以完美传输编码的数字表示形式,并且在数字存储的情况下,如果采取足够的措施更新存储,则可以永久存储数字表示形式的完美副本随着存储技术的发展。这意味着只要保留数字表示,就可以将语音信号重构到原始编码的精度之内。在任何一种情况下,语音编码器的目标都是从语音信号的样本开始并减少(压缩)代表语音信号所需的数据速率,同时保持所需的感知保真度。压缩的表示可以更有效地传输或存储,或者节省地位可以用于错误保护。
语音编码器可实现广泛的应用,包括窄带和宽带有线电话,蜂窝通信,互联网协议语音(VolP)(利用互联网作为实时通信介质)安全语音以实现隐私和加密(用于国家安全应用) ,极窄带的通信通道(例如使用高频(HF)无线电的战场应用程序),以及用于电话答录机交互式语音响应(IVR)系统的语音存储和预先记录的消息。语音编码人员通常会同时使用两种语音产品语音和语音的感知过程,因此对于更通用的音频信号(例如音乐)可能没有用。基于仅合并声音感知方面的编码器通常不会获得比基于语音产生的压缩那么多的压缩,但是它们更为通用,可用于所有类型的音频信号。这些编码器广泛部署在MP3和AAC播放器中,并用于数字电视系统中的音频[374]。
1.3.2语音合成
多年来,科学家和工程师一直在研究语音产生过程,目的是建立一个可以从文本开始并自动产生语音的系统。从某种意义上说,图1.7所示的文本到语音合成器是语音链图整个上部的数字模拟。系统的输入是普通文本,例如电子邮件消息或报纸或杂志上的文章。文本到语音合成系统中的第一个块被标记为语言规则,其作用是将打印的文本输入转换为机器可以合成的一组声音。从文本到声音的转换涉及一组语言规则,这些规则必须确定适当的声音集(可能包括诸如强调,停顿,讲话的速度等内容),以便最终的合成语音将表达单词的含义和意图。以自然语音传递的文本消息,可以被人类语音感知准确地解码。这比仅在发音词典中查找单词要困难得多,因为语言规则必须确定如何发音首字母缩写词,如何发音歧义词(如阅读,低音,宾语),缩写词(如St。)(街道或圣人),博士(医生或驱动器),以及如何正确发音专有名词,专门术语等。确定了文本的正确发音后,合成算法的作用是创建适当的声音序列来表示文本消息以言语形式。本质上,合成算法必须模拟声道系统在创建语音中的作用。组装语音并将其编译为适当的句子有很多过程,但是当今最有前途的过程被称为“单元选择和连接”。在这种方法中,计算机存储语音的每个基本语言单元(电话,半电话,音节等)的多个版本,然后确定哪个语音单元序列对于所生成的特定文本消息听起来效果最好。基本的数字表示通常不是采样的语音波。取而代之的是,通常使用某种形式的压缩表示来节省内存,更重要的是,它可以方便地控制持续时间并混合相邻声音。因此,语音合成算法将包括一个合适的解码器,如第1.3.1节中所述,其输出通过D-A转换器转换为模拟表示。
语音合成系统是现代人机通信系统的重要组成部分,可用于执行以下操作:通过电话阅读电子邮件,提供汽车GPS(全球定位系统)的语音输出,提供语音通话用于在Internet上完成交易的代理,处理呼叫中心服务台和客户服务应用程序,充当从手持设备(例如外语短语手册,词典,填字游戏助手)提供信息的声音,以及充当提供语音消息的公告机的声音诸如股票报价,航空公司时刻表,航班到达和起飞的更新等信息。另一个重要的应用是盲人阅读机。其中光学字符识别系统将文本输入提供给语音合成系统。
1.3.3语音识别和其他模式匹配问题
数字语音处理应用的另一大类涉及从语音信号中自动提取信息。大多数此类系统都涉及某种模式匹配。图1.8给出了解决语音处理中模式匹配问题的通用方法的框图。这些问题包括:语音识别,目标是从语音信号中提取消息;说话人识别,目标是识别谁在说话;说话者验证,目的是通过分析他们的语音信号来验证说话者声称的身份;单词发现,包括监视语音信号中是否存在指定单词或短语;并根据语音关键词的识别(或识别)自动为语音录音建立索引。
模式匹配系统中的第一个模块使用A / D转换器将模拟语音波形转换为数字形式。特征分析模块转换采样的语音信号l到一组特征向量。通常,在语音编码中使用的相同分析技术也用于导出特征向量。系统中的最后一块,即模式匹配块,将代表语音信号的特征向量集与一系列存储的模式进行动态时间对齐,并选择与时间最接近的模式相关联的身份-语音信号的特征向量的对齐集合。在语音识别的情况下,符号输出由一组识别的单词组成;在说话者的识别情况下,符号输出由最佳匹配的讲话者的身份组成;或者关于是否接受或拒绝说话者的身份声明的决定对于说话人验证,
尽管图1.8的框图表示了各种各样的语音模式匹配问题,但最大的用途是在语音的识别和理解方面,以支持通过语音进行人机通信。这种系统可以找到应用程序的主要领域包括计算机软件的命令和控制,用于创建信件,备忘录和其他文档的语音命令,与机器的自然语言语音对话以启用服务台和呼叫中心以及代理服务(例如日历)输入和更新,地址列表修改和输入。等等。
模式识别应用程序通常与其他数字语音处理应用程序一起出现。例如,语音技术的主要用途之一是在便携式通信设备中。以8 kbps量级的比特率进行语音编码可实现手机中的正常语音对话。手机中的口语语音识别功能具有语音拨号功能,该功能可以自动拨打与已识别名称关联的号码。可以使用简单的语音识别技术轻松识别和拨打来自数百个以上名称的目录中的名称。
另一个主要的语音应用程序长期以来一直是演讲的梦想。研究人员正在自动进行语言翻译。语言翻译系统的目标是将一种语言的口语转换为另一种语言的口语,以便促进讲不同语言的人们之间的自然语言语音对话。语言翻译技术要求能够同时使用两种语言的语音合成系统,以及同样适用于两种语言的语音识别(通常是自然语言理解)。因此,这是一项非常艰巨的任务,并且仅取得了有限的进展。当存在这样的系统时,说不同语言的人们就有可能以印刷文本阅读顺序的数据速率进行通讯!
1.3.4其他语音应用
语音通信的应用范围如图1.9所示。如该图所示,数字语音处理技术是广泛应用的关键组成部分,包括传输/存储,语音合成和语音识别这三个领域,以及说话人识别,语音信号等许多其他领域增强质量,并有助于听力或视力障碍。
图1.10中的框图表示了任何通过DSP技术处理诸如语音之类的时间信号的系统。该图简单地描述了一个概念,即一旦对语音信号进行了采样,就可以通过DSP技术以几乎无限的方式对其进行操作。再一次,语音信号的操纵和修改通常是通过将语音信号转换为其他表示形式(这是由我们对语音产生和语音感知的理解所激发的),然后通过进一步的数字计算对该表示进行操作,然后再转换回去。使用数模转换器将其转换到波形域。
一个重要的应用领域是语音增强,其中的目标是消除或抑制麦克风与所需语音信号一起拾取的噪声,回声或混响。在人与人之间的交流中,语音增强系统的目标是使语音更清晰,更自然:但是,实际上,到目前为止,所取得的最好成绩是基本保持不变但不会改善的令人讨厌的语音,退化语音的清晰度。但是,已经成功地使失真的语音信号作为语音编码器,合成器或识别器的一部分,对进一步处理更加有用[2121]。
语音信号的其他处理示例包括时间刻度修改,以使语音与视频片段对齐,修改语音质量以及加快或减慢预先录制的语音(例如,用于谈话本,语音邮件的快速查看)信息,或仔细检查口头材料)。信号的这种修改通常更容易在基本数字表示形式之一上完成,而不是在样本上完成。d波形本身
关于参考的评论
本书结尾的参考书目包含所有各章的所有参考。 这些参考文献中的许多都是在数字语音处理领域中建立了重要成果的研究论文。 书目中还包括许多经常被参考的重要且有价值的参考书。 其中一些是“经典”文本,在该领域的发展中占有特殊的地位。 其他的则是较新的,因此它们提供了有关该领域最新发展的知识。 以下内容按出版日期的时间顺序排列,并按本章中的建议类别列出,以下是我们在教学和研究中参考过的教科书。 这些将特别引起关注,尤其是在我们的主题报道不太详尽的应用领域。
1.3 APPLICATIONS OF DIGITAL SPEECH PROCESSING
The first step in most applications of digital speech processing is to convert the acoustic waveform to a sequence of numbers. This discrete-time representation is the starting point for most applications. From this point, more powerful representations are obtained by digital processing. For the most part, these alternative representations are based on combining DSP operations with knowledge about the workings of the speech chain as depicted in Figure 1.4. As we will see, it is possible to incorporate aspects of both the speech production and speech perception processes into the digital representation and processing. As our discussion unfolds, it will become clear that it is not an oversimplification to assert that digital speech processing is grounded in a set of techniques that have the goal of pushing the data rate of the speech representation to the left (i.e., lowering the data rate) along either the upper or lower path in Figure 1.4.
1.3.1 The remainder of this chapter is devoted to a brief summary of the applications of digital speech processing; i.e., the systems that people interact with daily. Our discussion will emphasize the importance of the chosen digital representation in all application areas
Speech Coding Perhaps
the most widespread applications of digital speech processing technology occur in the areas of digital transmission and storage of speech signals. In these areas the centrality of the digital representation is obvious, since the goal is to compress the digital waveform representation of speech into a lower bit-rate representation. It is common to refer to this activity as "speech coding" or "speech compression." Our discussion of the information content of speech in Section 1.1 suggests that there is much room for compression. In general, this compression is achieved by combining DSP techniques with fundamental knowledge of the speech production and perception processes.
Figure 1.6 shows a block diagram of a generic speech encoding/decoding (or compression) system. In the upper part of the figure, the A-to-D converter converts the analog speech signal x(t) to a sampled waveform representation x[n]. The digital signal x[n] is analyzed and coded by digital computation algorithms to produce a new digital signal y[n] that can be transmitted over a digital communication channel or stored in a digital storage medium as y[n]. As we will see, particularly in Chapter 11 for speech and Chapter 12 for audio signals, there are a myriad of ways to do the encoding so as to reduce the data rate over that of the sampled and quantized waveform xInl. Because the digital representation at this point is often not directly related to the sampled speech waveform, y[n] and y[n] are appropriately referred to as data signals that represent the speech signal. The lower path in Figure 1.6 shows the decoder associated with the speech coder. The received data signal ý{n] is decoded using the inverse of the analysis processing, giving the sequence of samples {[n], which is then converted (using a D-to-A Converter) back to an analog signal (t) for human listening. The decoder is often called a synthesizer because it must reconstitute the speech waveform from data that often bear no direct relationship to the original waveform samples.
With carefully designed error protection coding of the digital representation, the transmitted (y[n]) and received (n) data can be essentially identical. This is the quintessential feature of digital coding. In theory, perfect transmission of the coded digital representation is possible even under very noisy channel conditions, and in the case of digital storage, it is possible to store a perfect copy of the digital representation in perpetuity if sufficient care is taken to update the storage medium as storage technology advances. This means that the speech signal can be reconstructed to within the accuracy of the original coding for as long as the digital representation is retained In either case, the goal of the speech coder is to start with samples of the speech signal and reduce (compress) the data rate required to represent the speech signal while maintaining a desired perceptual fidelity. The compressed representation can be more efficiently transmitted or stored, or the bits saved can be devoted to error protection.
Speech coders enable a broad range of applications including narrowband and broadband wired telephony, cellular communications, voice over Internet protocol (VolP) (which utilizes the Internet as a real-time communications medium) secure voice for privacy and encryption (for national security applications), extremely narrowband communications channels [such as battlefield applications using high frequency (HF) radio], and for storage of speech for telephone answering machines interactive voice response (IVR) systems, and pre-recorded messages. Speech coders often employ many aspects of both the speech production and speech perception processes, and hence may not be useful for more general audio signals such as music. Coders that are based on incorporating only aspects of sound perception generally do not achieve as much compression as those based on speech production, but they are more general and can be used for all types of audio signals. These coders are widely deployed in MP3 and AAC players and for audio in digital television systems [374].
1.3.2 Text-to-Speech Synthesis
For many years, scientists and engineers have studied the speech production process with the goal of building a system that can start with text and produce speech automatically.
In a sense, a text-to-speech synthesizer, such as the one depicted in Figure 1.7.is a digital simulation of the entire upper part of the speech chain diagram. The input to the system is ordinary text such as an email message or an article from a newspaper or magazine. The first block in the text-to-speech synthesis system, labeled linguistid rules, has the job of converting the printed text input into a set of sounds that the machine can synthesize. The conversion from text to sounds involves a set of linguistic rules that must determine the appropriate set of sounds (perhaps including things like emphasis, pauses, rates of speaking, etc.) so that the resulting synthetic speech will express the words and intent of the text message in what passes for a natural voice that can be decoded accurately by human speech perception. This is more difficult than simply looking up the words in a pronouncing dictionary because the linguistic rules must determine how to pronounce acronyms, how to pronounce ambiguous words like read, bass, object, how to pronounce abbreviations like St. (street or saint), Dr. (doctor or drive), and how to properly pronounce proper names, specialized terms, etc. Once the proper pronunciation of the text has been determined, the role of the synthesis algorithm is to create the appropriate sound sequence to represent the text message in the form of speech. In essence, the synthesis algorithm must simulate the action of the vocal tract system in creating the sounds of speech. There are many procedures for assembling the speech sounds and compiling them into a proper sentence, but the most promising one today is called "unit selection and concatenation." In this method the computer stores multiple versions of each of the basic linguistic units of speech (phones, half phones, syllables, etc.), and then decides which sequence of speech units sounds best for the particular text message that is being produced. The basic digital representation is not generally the sampled speech wave. Instead, some sort of compressed representation is normally used to save memory and, more importantly, to allow convenient manipulation of durations and blending of adjacent sounds. Thus, the speech synthesis algorithm would include an appropriate decoder, as discussed in Section 1.3.1, whose output is converted to an analog representation via the D-to-A converter.
Text-to-speech synthesis systems are an essential component of modern human machine communications systems and are used to do things like read email messages over a telephone, provide voice output from a GPS (global positioning system) in automobiles, provide the voices for talking agents for completion of transactions over the Internet, handle call center help desks and customer care applications, serve as the voice for providing information from handheld devices such as foreign language phrasebooks, dictionaries, crossword puzzle helpers, and as the voice of announcement machines that provide information such as stock quotes, airline schedules, updates on arrivals and departures of flights, etc. Another important application is in reading machines for the blind. where an optical character recognition system provides the text input to a speech synthesis system.
1.3.3 Speech Recognition and Other Pattern Matching Problems
Another large class of digital speech processing applications is concerned with the automatic extraction of information from the speech signal.
Most such systems involve some sort of pattern matching. Figure 1.8 shows a block diagram of a generic approach to pattern matching problems in speech processing. Such problems include the following: speech recognition, where the object is to extract the message from the speech signal; speaker recognition, where the goal is to identify who is speaking; speaker verification, where the goal is to verify a speaker's claimed identity from analysis of their speech signal; word spotting, which involves monitoring a speech signal for the occurrence of specified words or phrases; and automatic indexing of speech recordings based on recognition (or spotting) of spoken keywords.
The first block in the pattern matching system converts the analog speech waveform to digital form using an A-to-D converter. The feature analysis module converts the sampled speech signal to a set of feature vectors. Often, the same analysis techniques that are used in speech coding are also used to derive the feature vectors. The final block in the system, namely the pattern matching block, dynamically time aligns the set of feature vectors representing the speech signal with a concatenated set of stored patterns, and chooses the identity associated with the pattern that is the closest match to the time-aligned set of feature vectors of the speech signal. The symbolic output consists of a set of recognized words, in the case of speech recognition, or the identity of the best matching talker, in the case of speaker recognition, or a decision as to whether to accept or reject the identity claim of a speaker in the case of speaker verification,
Although the block diagram of Figure 1.8 represents a wide range of speech pattern matching problems, the biggest use has been in the area of recognition and understanding of speech in support of human-machine communication by voice. The major areas where such a system finds applications include command and control of computer software, voice dictation to create letters, memos, and other documents, natural language voice dialogues with machines to enable help desks and call centers, and for agent services such as calendar entry and update, address list modification and entry. etc.
Pattern recognition applications often occur in conjunction with other digital speech processing applications. For example, one of the pre-eminent uses of speech technology is in portable communication devices. Speech coding at bit rates on the order of 8 kbps enables normal voice conversations in cell phones. Spoken name speech recognition in cell phones enables voice dialing capability, which can automatically dial the number associated with the recognized name. Names from directories with upwards of several hundred names can readily be recognized and dialed using simple speech recognition technology.
Another major speech application that has long been a dream of speech. researchers is automatic language translation. The goal of language translation systems is to convert spoken words in one language to spoken words in another language so as to facilitate natural language voice dialogues between people speaking different languages. Language translation technology requires speech synthesis systems that work in both languages, along with speech recognition (and generally natural language understanding) that also works for both languages; hence it is a very difficult task and one for which only limited progress has been made. When such systems exist, it will be possible for people speaking different languages to communicate at data rates on the order of that of printed text reading!
1.3.4 Other Speech Applications
The range of speech communication applications is illustrated in Figure 1.9. As seen in this figure, the techniques of digital speech processing are a key ingredient of a wide range of applications that include the three areas of transmission/storage, speech synthesis, and speech recognition as well as many others such as speaker identification, speech signal quality enhancement, and aids for the hearing or visually impaired.
The block diagram in Figure 1.10 represents any system where time signals such as speech are processed by the techniques of DSP. This figure simply depicts the notion that once the speech signal is sampled, it can be manipulated in virtually limitless ways by DSP techniques. Here again, manipulations and modifications of the speech signal are usually achieved by transforming the speech signal into an alternative representation (that is motivated by our understanding of speech production and speech perception), operating on that representation by further digital computation, and then transforming back to the waveform domain, using a D-to-A converter.
One important application area is speech enhancement, where the goal is to remove or suppress noise or echo or reverberation picked up by a microphone along with the desired speech signal. In human-to-human communication, the goal of speech enhancement systems is to make the speech more intelligible and more natural: however, in reality the best that has been achieved so far is less perceptually annoying speech that essentially maintains, but does not improve, the intelligibility of the degraded speech. Success has been achieved, however, in making distorted speech signals more useful for further processing as part of a speech coder, synthesizer, or recognizer [2121.
Other examples of manipulation of the speech signal include time-scale modification to align voices with video segments, to modify voice qualities, and to speed-up or slow-down pre-recorded speech (e.g., for talking books, rapid review of voice mail messages, or careful scrutinizing of spoken material). Such modifications of the signal are often more easily done on one of the basic digital representations rather than on the sampled waveform itself
COMMENT ON THE REFERENCES
The bibliography at the end of this book contains all the references for all the chapters. Many of these references are research papers that established important results in the field of digital speech processing. Also included in the bibliography are a number of important and valuable reference texts that are often referenced as well. Some of these are "classic" texts that hold a special place in the evolution of the field. Others are more recent, and thus they provide knowledge about the latest developments in the field. The following, listed in chronological order by publication date and in categories suggested in this chapter, are texts that we have consulted in our teaching and research. These will be of special interest particularly with regard to application areas where our coverage of topics is less detailed.
• 1.3 APPLICATIONS OF DIGITAL SPEECH PROCESSING
1.3数字语音处理的应用
• The first step in most applications of digital speech processing is to convert the acoustic waveform to a sequence of numbers.
在大多数数字语音处理的应用中,第一步是将声波波形转换成数字序列。
• This discrete-time representation is the starting point for most applications.
这种离散时间表示是大多数应用程序的起点。
• From this point, more powerful representations are obtained by digital processing.
从这一点出发,通过数字处理可以得到更强大的表示。
• For the most part, these alternative representations are based on combining DSP operations with knowledge about the workings of the speech chain as depicted in Figure 1.4.
在大多数情况下,这些替代表示是基于将DSP操作与图1.4所示的语音链工作原理的知识相结合。
• As we will see, it is possible to incorporate aspects of both the speech production and speech perception processes into the digital representation and processing.
正如我们将看到的,可以将语音产生和语音感知过程的各个方面合并到数字表示和处理中。
• As our discussion unfolds, it will become clear that it is not an oversimplification to assert that digital speech processing is grounded in a set of techniques that have the goal of pushing the data rate of the speech representation to the left (i.e., lowering the data rate) along either the upper or lower path in Figure 1.4.
作为我们的讨论的展开,它将变得明显,这不是一个简单地断言,数字语音处理是基于一组技术,推动演讲的数据率的目标表示左(即降低数据率)上或较低的路径如图1.4所示。
•
• 1.3.1 The remainder of this chapter is devoted to a brief summary of the applications of digital speech processing;
1.3.1本章其余部分将对数字语音处理的应用进行简要总结;
• i.e., the systems that people interact with daily.
也就是说,人们每天与之互动的系统。
• Our discussion will emphasize the importance of the chosen digital representation in all application areas
我们的讨论将强调所选数字表示在所有应用领域的重要性
• Speech Coding Perhaps
语音编码也许
• the most widespread applications of digital speech processing technology occur in the areas of digital transmission and storage of speech signals.
数字语音处理技术最广泛的应用是在语音信号的数字传输和存储领域。
• In these areas the centrality of the digital representation is obvious, since the goal is to compress the digital waveform representation of speech into a lower bit-rate representation.
在这些领域中,数字表示的中心性是明显的,因为目标是将语音的数字波形表示压缩为较低的比特率表示。
• It is common to refer to this activity as "speech coding" or "speech compression."
通常将此活动称为“语音编码”或“语音压缩”。
• Our discussion of the information content of speech in Section 1.1 suggests that there is much room for compression.
我们在1.1节中对语音的信息内容进行了讨论,认为有很大的压缩空间。
• In general, this compression is achieved by combining DSP techniques with fundamental knowledge of the speech production and perception processes.
一般来说,这种压缩是通过将DSP技术与语音产生和感知过程的基本知识相结合来实现的。
•
• Figure 1.6 shows a block diagram of a generic speech encoding/decoding (or compression) system.
图1.6显示了一个通用语音编码/解码(或压缩)系统的框图。
• In the upper part of the figure, the A-to-D converter converts the analog speech signal x(t) to a sampled waveform representation x[n].
在图的上部,a -to- d转换器将模拟语音信号x(t)转换为采样波形表示x[n]。
• The digital signal x[n] is analyzed and coded by digital computation algorithms to produce a new digital signal y[n] that can be transmitted over a digital communication channel or stored in a digital storage medium as y[n].
数字信号x[n]通过数字计算算法进行分析和编码,产生新的数字信号y[n],可以通过数字通信通道传输,也可以存储在数字存储介质y[n]中。
• As we will see, particularly in Chapter 11 for speech and Chapter 12 for audio signals, there are a myriad of ways to do the encoding so as to reduce the data rate over that of the sampled and quantized waveform xInl.
正如我们将看到的,特别是在第11章的语音和第12章的音频信号中,有无数的方法来进行编码,以降低数据率的采样和量化波形xInl。
• Because the digital representation at this point is often not directly related to the sampled speech waveform, y[n] and y[n] are appropriately referred to as data signals that represent the speech signal.
由于此时的数字表示通常与采样语音波形没有直接关系,y[n]和y[n]被恰当地称为表示语音信号的数据信号。
• The lower path in Figure 1.6 shows the decoder associated with the speech coder.
图1.6中较低的路径显示了与语音编码器相关联的解码器。
• The received data signal ý{n] is decoded using the inverse of the analysis processing, giving the sequence of samples {[n], which is then converted (using a D-to-A Converter) back to an analog signal (t) for human listening.
接收到的数据信号ý{n]被解码使用逆向分析处理,给出样本序列{[n],然后被转换(使用D-to-A转换器)回模拟信号(t)供人类收听。
• The decoder is often called a synthesizer because it must reconstitute the speech waveform from data that often bear no direct relationship to the original waveform samples.
解码器通常被称为合成器,因为它必须从通常与原始波形样本没有直接关系的数据中重构语音波形。
• With carefully designed error protection coding of the digital representation, the transmitted (y[n]) and received (n) data can be essentially identical.
通过仔细设计数字表示的错误保护编码,传输(y[n])和接收(n)数据可以本质上相同。
• This is the quintessential feature of digital coding.
这是数字编码的典型特征。
• In theory, perfect transmission of the coded digital representation is possible even under very noisy channel conditions, and in the case of digital storage, it is possible to store a perfect copy of the digital representation in perpetuity if sufficient care is taken to update the storage medium as storage technology advances.
理论上,完美的传输编码的数字表示有可能即使非常噪声信道条件下,数字存储的情况下,可以永久存储的一个完美的副本数字表示如果足够的护理来更新存储介质存储技术的进步。
• This means that the speech signal can be reconstructed to within the accuracy of the original coding for as long as the digital representation is retained In either case, the goal of the speech coder is to start with samples of the speech signal and reduce (compress) the data rate required to represent the speech signal while maintaining a desired perceptual fidelity.
这意味着可以重建语音信号在原始编码的准确性,只要数字表示保留在任何一种情况下,语音编码器的目标开始语音信号样本,减少(压缩)代表语音信号所需的数据率,同时保持所需的知觉忠诚。
• The compressed representation can be more efficiently transmitted or stored, or the bits saved can be devoted to error protection.
压缩的表示可以更有效地传输或存储,或者保存的比特可以专用于错误保护。
•
• Speech coders enable a broad range of applications including narrowband and broadband wired telephony, cellular communications, voice over Internet protocol (VolP) (which utilizes the Internet as a real-time communications medium) secure voice for privacy and encryption (for national security applications), extremely narrowband communications channels [such as battlefield applications using high frequency (HF) radio],
演讲程序员使范围广泛的应用程序包括窄带和宽带有线电话、移动通信、互联网协议语音(VolP)(利用互联网作为一个实时通信介质)的隐私安全的声音和加密(国家安全应用程序),极窄带通信渠道(如战场使用高频(HF)广播的应用程序),
• and for storage of speech for telephone answering machines interactive voice response (IVR) systems, and pre-recorded messages.
以及用于存储电话答录机的语音、交互式语音应答(IVR)系统和预先录制的信息。
• Speech coders often employ many aspects of both the speech production and speech perception processes, and hence may not be useful for more general audio signals such as music.
语音编码器经常使用语音产生和语音感知过程的许多方面,因此可能不适用于更一般的音频信号,如音乐。
• Coders that are based on incorporating only aspects of sound perception generally do not achieve as much compression as those based on speech production, but they are more general and can be used for all types of audio signals.
仅基于声音感知方面的编码器通常不能实现与基于语音生成的编码器一样多的压缩,但它们更通用,可以用于所有类型的音频信号。
• These coders are widely deployed in MP3 and AAC players and for audio in digital television systems [374].
这些编码器广泛应用于MP3和AAC播放器以及数字电视系统的音频[374]。
•
• 1.3.2 Text-to-Speech Synthesis
1.3.2语音合成
• For many years, scientists and engineers have studied the speech production process with the goal of building a system that can start with text and produce speech automatically.
多年来,科学家和工程师们一直在研究语音产生过程,他们的目标是建立一个可以从文本开始并自动产生语音的系统。
• In a sense, a text-to-speech synthesizer, such as the one depicted in Figure 1.7.
从某种意义上说,一种文本-语音合成器,如图1.7所示。
• is a digital simulation of the entire upper part of the speech chain diagram.
是一个数字仿真的整个上半部分语音链图。
• The input to the system is ordinary text such as an email message or an article from a newspaper or magazine.
该系统的输入是普通文本,如电子邮件信息或报纸或杂志上的一篇文章。
• The first block in the text-to-speech synthesis system, labeled linguistid rules, has the job of converting the printed text input into a set of sounds that the machine can synthesize.
文本-语音合成系统的第一个区块被标记为“语言学规则”(linguistid rules),它的任务是将打印出来的文本输入转换成机器可以合成的一组声音。
• The conversion from text to sounds involves a set of linguistic rules that must determine the appropriate set of sounds (perhaps including things like emphasis, pauses, rates of speaking, etc.) so that the resulting synthetic speech will express the words and intent of the text message in what passes for a natural voice that can be decoded accurately by human speech perception.
从文本转换声音包括一组语言规则,必须确定适当的声音(或许包括强调,停顿了一下,说,等),使产生的合成语音表达单词和文本消息的意图通过自然的声音可以通过人类语言解码准确感知。
• This is more difficult than simply looking up the words in a pronouncing dictionary because the linguistic rules must determine how to pronounce acronyms, how to pronounce ambiguous words like read, bass, object, how to pronounce abbreviations like St. (street or saint), Dr. (doctor or drive), and how to properly pronounce proper names, specialized terms, etc.
这是更加困难比简单地查找单词在发音字典,因为语言缩写规则必须决定如何发音,如何发音含糊不清的词读,鲈鱼,对象,如何发音缩写像圣(街道或圣人),博士(医生或开车),以及如何正确地发音正确的姓名、专业术语等。
• Once the proper pronunciation of the text has been determined, the role of the synthesis algorithm is to create the appropriate sound sequence to represent the text message in the form of speech.
一旦确定了文本的正确发音,合成算法的作用是创建合适的语音序列,以语音的形式来表示文本信息。
• In essence, the synthesis algorithm must simulate the action of the vocal tract system in creating the sounds of speech.
实质上,合成算法必须模拟声道系统在产生语音时的动作。
• There are many procedures for assembling the speech sounds and compiling them into a proper sentence, but the most promising one today is called "unit selection and concatenation."
把语音组合成一个合适的句子有很多方法,但目前最有希望的方法是“单元选择和连接”。
• In this method the computer stores multiple versions of each of the basic linguistic units of speech (phones, half phones, syllables, etc.), and then decides which sequence of speech units sounds best for the particular text message that is being produced.
在这种方法中,计算机存储每个基本语音单元(电话、半电话、音节等)的多个版本,然后决定哪个序列的语音单元听起来最适合正在生成的特定文本信息。
• The basic digital representation is not generally the sampled speech wave.
基本的数字表示通常不是采样的语音波。
• Instead, some sort of compressed representation is normally used to save memory and, more importantly, to allow convenient manipulation of durations and blending of adjacent sounds.
相反,通常使用某种压缩表示来节省内存,更重要的是,允许方便地操作持续时间和混合相邻的声音。
• Thus, the speech synthesis algorithm would include an appropriate decoder, as discussed in Section 1.3.1, whose output is converted to an analog representation via the D-to-A converter.
因此,语音合成算法将包括一个适当的解码器,如第1.3.1节所述,其输出通过D-to-A转换器转换为模拟表示。
• Text-to-speech synthesis systems are an essential component of modern human machine communications systems and are used to do things like read email messages over a telephone, provide voice output from a GPS (global positioning system) in automobiles, provide the voices for talking agents for completion of transactions over the Internet, handle call center help desks and customer care applications,
语音合成系统是现代人机通信系统必不可少的组成部分,用于做事情喜欢读电子邮件/电话,提供语音输出汽车GPS(全球定位系统),提供代理说话的声音在网上完成交易,处理呼叫中心帮助台和客户服务的应用程序,
• serve as the voice for providing information from handheld devices such as foreign language phrasebooks, dictionaries, crossword puzzle helpers, and as the voice of announcement machines that provide information such as stock quotes, airline schedules, updates on arrivals and departures of flights, etc.
作为语音,提供来自手持设备的信息,如外语短语书、字典、纵横字谜助手,以及作为语音报幕机,提供信息,如股票报价、航班时刻表、航班到达和起飞的最新情况等。
• Another important application is in reading machines for the blind.
另一个重要的应用是盲人阅读机。
• where an optical character recognition system provides the text input to a speech synthesis system.
其中,光学字符识别系统为语音合成系统提供文本输入。
•
• 1.3.3 Speech Recognition and Other Pattern Matching Problems
1.3.3语音识别等模式匹配问题
• Another large class of digital speech processing applications is concerned with the automatic extraction of information from the speech signal.
另一大类数字语音处理应用涉及到从语音信号中自动提取信息。
• Most such systems involve some sort of pattern matching.
大多数这样的系统都涉及某种类型的模式匹配。
• Figure 1.8 shows a block diagram of a generic approach to pattern matching problems in speech processing.
图1.8显示了语音处理中模式匹配问题的一般方法的框图。
• Such problems include the following: speech recognition, where the object is to extract the message from the speech signal;
这些问题包括:语音识别,其目标是从语音信号中提取信息;
• speaker recognition, where the goal is to identify who is speaking;
说话人识别,目标是识别说话的人;
• speaker verification, where the goal is to verify a speaker's claimed identity from analysis of their speech signal;
说话人验证,其目标是通过分析说话人的语音信号来验证其声称的身份;
• word spotting, which involves monitoring a speech signal for the occurrence of specified words or phrases;
单词识别,即监控语音信号中出现的特定单词或短语;
• and automatic indexing of speech recordings based on recognition (or spotting) of spoken keywords.
以及基于语音关键字识别(或识别)的语音录音自动索引。
• The first block in the pattern matching system converts the analog speech waveform to digital form using an A-to-D converter.
模式匹配系统中的第一个模块使用A-to-D转换器将模拟语音波形转换为数字形式。
• The feature analysis module converts the sampled speech signal to a set of feature vectors.
特征分析模块将采样后的语音信号转换成一组特征向量。
• Often, the same analysis techniques that are used in speech coding are also used to derive the feature vectors.
通常,在语音编码中使用的同样的分析技术也被用于获得特征向量。
• The final block in the system, namely the pattern matching block, dynamically time aligns the set of feature vectors representing the speech signal with a concatenated set of stored patterns, and chooses the identity associated with the pattern that is the closest match to the time-aligned set of feature vectors of the speech signal.
系统中的最后一块,也就是模式匹配块,动态时间对齐特征向量的集合代表语音信号连接的存储模式,并选择相关的身份的模式是最接近的匹配time-aligned组语音信号的特征向量。
• The symbolic output consists of a set of recognized words, in the case of speech recognition, or the identity of the best matching talker, in the case of speaker recognition, or a decision as to whether to accept or reject the identity claim of a speaker in the case of speaker verification,
象征性的输出由一组公认的话说,在语音识别的情况下,或最匹配的身份说话,在说话人识别的情况下,或决定是否接受或拒绝一个演讲者的身份要求演讲者验证的情况下,
•
• Although the block diagram of Figure 1.8 represents a wide range of speech pattern matching problems, the biggest use has been in the area of recognition and understanding of speech in support of human-machine communication by voice.
虽然图1.8的方框图代表了广泛的语音模式匹配问题,但最大的应用是在语音识别和理解领域,以支持通过语音进行人机交流。
• The major areas where such a system finds applications include command and control of computer software, voice dictation to create letters, memos, and other documents, natural language voice dialogues with machines to enable help desks and call centers, and for agent services such as calendar entry and update, address list modification and entry.
这样一个系统发现应用的主要领域包括计算机软件的指挥和控制、语音听写创建信件,备忘录,和其他文件,与机器自然语言语音对话能够帮助台和呼叫中心,和代理服务,如日历条目和更新,地址列表修改和条目。
• etc.
等。
•
• Pattern recognition applications often occur in conjunction with other digital speech processing applications.
模式识别应用通常与其他数字语音处理应用同时出现。
• For example, one of the pre-eminent uses of speech technology is in portable communication devices.
例如,语音技术的杰出应用之一是在便携式通信设备上。
• Speech coding at bit rates on the order of 8 kbps enables normal voice conversations in cell phones.
以8kbps的比特率进行语音编码,可以在移动电话中进行正常的语音对话。
• Spoken name speech recognition in cell phones enables voice dialing capability, which can automatically dial the number associated with the recognized name.
手机语音识别功能可以实现语音拨号功能,它可以自动拨出与识别的名字相关联的号码。
• Names from directories with upwards of several hundred names can readily be recognized and dialed using simple speech recognition technology.
使用简单的语音识别技术,可以很容易地识别和拨打几百个以上名字的目录中的名字。
• Another major speech application that has long been a dream of speech.
另一个主要的演讲应用,一直是演讲的梦想。
• researchers is automatic language translation.
研究人员是自动语言翻译。
• The goal of language translation systems is to convert spoken words in one language to spoken words in another language so as to facilitate natural language voice dialogues between people speaking different languages.
语言翻译系统的目标是将一种语言的口语转换为另一种语言的口语,从而促进不同语言的人们之间的自然语言语音对话。
• Language translation technology requires speech synthesis systems that work in both languages, along with speech recognition (and generally natural language understanding) that also works for both languages;
语言翻译技术需要同时适用于两种语言的语音合成系统,以及同时适用于两种语言的语音识别(以及通常的自然语言理解);
• hence it is a very difficult task and one for which only limited progress has been made.
因此,这是一项非常困难的任务,而且只取得了有限的进展。
• When such systems exist, it will be possible for people speaking different languages to communicate at data rates on the order of that of printed text reading!
当这样的系统存在时,讲不同语言的人们将有可能以打印文本阅读的顺序进行数据交流!
• 1.3.4 Other Speech Applications
1.3.4其他语音应用
• The range of speech communication applications is illustrated in Figure 1.9.
语音通信应用的范围如图1.9所示。
• As seen in this figure, the techniques of digital speech processing are a key ingredient of a wide range of applications that include the three areas of transmission/storage, speech synthesis, and speech recognition as well as many others such as speaker identification, speech signal quality enhancement, and aids for the hearing or visually impaired.
见这个数字,数字语音处理的技术是广泛应用的关键因素,包括传播的三个领域/存储、语音合成、语音识别以及许多其他演讲者识别等语音信号质量提高,和艾滋病的听力或视力受损。
•
•
• The block diagram in Figure 1.10 represents any system where time signals such as speech are processed by the techniques of DSP.
图1.10中的框图表示任何系统,时间信号如语音是由DSP技术处理的。
• This figure simply depicts the notion that once the speech signal is sampled, it can be manipulated in virtually limitless ways by DSP techniques.
这个图简单地描述了一旦语音信号被采样,它可以被DSP技术以几乎无限的方式操纵的概念。
• Here again, manipulations and modifications of the speech signal are usually achieved by transforming the speech signal into an alternative representation (that is motivated by our understanding of speech production and speech perception), operating on that representation by further digital computation, and then transforming back to the waveform domain, using a D-to-A converter.
这里,操作和修改的语音信号通常是通过将语音信号转换成另一种表示形式(即出于我们对演讲的理解生产和言语知觉),通过进一步操作,表示数字计算,然后将波形域,用数模转换器。
• One important application area is speech enhancement, where the goal is to remove or suppress noise or echo or reverberation picked up by a microphone along with the desired speech signal.
一个重要的应用领域是语音增强,其目标是消除或抑制噪声、回声或混响,这些都是由麦克风和所需的语音信号一起拾取的。
• In human-to-human communication, the goal of speech enhancement systems is to make the speech more intelligible and more natural: however, in reality the best that has been achieved so far is less perceptually annoying speech that essentially maintains, but does not improve, the intelligibility of the degraded speech.
在人与人之间的交流,语音增强系统的目标是使演讲更易懂、更自然,然而,在现实中,已经做到了最好的到目前为止不感知恼人的演讲本质上维护,但并不能提高,退化的可解性的演讲。
• Success has been achieved, however, in making distorted speech signals more useful for further processing as part of a speech coder, synthesizer, or recognizer [2121.
然而,在将失真的语音信号作为语音编码器、合成器或识别器的一部分进行进一步处理方面已经取得了成功。
•
• Other examples of manipulation of the speech signal include time-scale modification to align voices with video segments, to modify voice qualities, and to speed-up or slow-down pre-recorded speech (e.g., for talking books, rapid review of voice mail messages, or careful scrutinizing of spoken material).
语音信号处理的其他例子包括时间尺度修改,使声音与视频片段对齐,修改声音质量,加速或放慢预先录制的语音(例如,有声书,快速查看语音邮件信息,或仔细检查语音材料)。
• Such modifications of the signal are often more easily done on one of the basic digital representations rather than on the sampled waveform itself
这种对信号的修改通常在一种基本数字表示上比在采样波形本身上更容易完成
•
• COMMENT ON THE REFERENCES
对参考文献进行评论
• The bibliography at the end of this book contains all the references for all the chapters.
本书末尾的参考书目包含了各章节的全部参考资料。
• Many of these references are research papers that established important results in the field of digital speech processing.
其中许多参考文献都是在数字语音处理领域取得重要成果的研究论文。
• Also included in the bibliography are a number of important and valuable reference texts that are often referenced as well.
参考书目中还包括一些重要和有价值的参考文献,这些文献也经常被引用。
• Some of these are "classic" texts that hold a special place in the evolution of the field.
其中一些是“经典”文本,在该领域的演变中占有特殊的地位。
• Others are more recent, and thus they provide knowledge about the latest developments in the field.
其他的是最近的,因此它们提供了该领域最新发展的知识。
• The following, listed in chronological order by publication date and in categories suggested in this chapter, are texts that we have consulted in our teaching and research.
以下是我们在教学和研究中参考过的文献,按出版日期和本章建议的类别按时间顺序列出。
• These will be of special interest particularly with regard to application areas where our coverage of topics is less detailed.
这些将是特别感兴趣的,特别是在应用领域,我们的主题覆盖较不详细。
,