دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش:
نویسندگان: Xu Ta
سری: Artificial Intelligence: Foundations, Theory, and Algorithms
ISBN (شابک) : 9789819908264, 9789819908271
ناشر: Springer
سال نشر: 2023
تعداد صفحات: 214
زبان: English
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 9 Mb
در صورت تبدیل فایل کتاب Neural Text-to-Speech Synthesis به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب سنتز متن به گفتار عصبی نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
هدف متن به گفتار (TTS) ترکیب گفتار قابل فهم و طبیعی بر اساس متن داده شده است. این یک موضوع داغ در تحقیقات زبان، گفتار و یادگیری ماشین است و کاربردهای گسترده ای در صنعت دارد. این کتاب TTS مبتنی بر شبکه عصبی را در عصر یادگیری عمیق معرفی میکند، با هدف ارائه درک خوبی از TTS عصبی، تحقیقات و کاربردهای فعلی و روند تحقیقات آینده. این کتاب ابتدا تاریخچه فناوریهای TTS را معرفی میکند و TTS عصبی را مرور میکند و دانش اولیه در مورد پردازش زبان و گفتار، شبکههای عصبی و یادگیری عمیق و مدلهای مولد عمیق را ارائه میدهد. سپس TTS عصبی را از منظر مؤلفههای کلیدی (تحلیل متن، مدلهای صوتی، رمزگذارهای صوتی، و مدلهای انتها به انتها) و موضوعات پیشرفته (TTS بیانی و قابل کنترل، قوی، کارآمد مدل و داده کارآمد) معرفی میکند. همچنین برخی از جهت گیری های تحقیقاتی آینده را نشان می دهد و برخی منابع مربوط به TTS را جمع آوری می کند. این کتاب اولین کتابی است که TTS عصبی را به روشی جامع و قابل درک معرفی میکند و میتواند هم به محققان دانشگاهی و هم به متخصصان صنعت که روی TTS کار میکنند خدمت کند.
Text-to-speech (TTS) aims to synthesize intelligible and natural speech based on the given text. It is a hot topic in language, speech, and machine learning research and has broad applications in industry. This book introduces neural network-based TTS in the era of deep learning, aiming to provide a good understanding of neural TTS, current research and applications, and the future research trend. This book first introduces the history of TTS technologies and overviews neural TTS, and provides preliminary knowledge on language and speech processing, neural networks and deep learning, and deep generative models. It then introduces neural TTS from the perspective of key components (text analyses, acoustic models, vocoders, and end-to-end models) and advanced topics (expressive and controllable, robust, model-efficient, and data-efficient TTS). It also points some future research directions and collects some resources related to TTS. This book is the first to introduce neural TTS in a comprehensive and easy-to-understand way and can serve both academic researchers and industry practitioners working on TTS.
Foreword by Dong Yu Foreword by Heiga Zen Foreword by Haizhou Li Preface Acknowledgements Contents Acronyms About the Author 1 Introduction 1.1 Motivation 1.2 History of TTS Technology 1.2.1 Articulatory Synthesis 1.2.2 Formant Synthesis 1.2.3 Concatenative Synthesis 1.2.4 Statistical Parametric Synthesis 1.3 Overview of Neural TTS 1.3.1 TTS in the Era of Deep Learning 1.3.2 Key Components of TTS 1.3.3 Advanced Topics in TTS 1.3.4 Other Taxonomies of TTS 1.3.5 Evolution of Neural TTS 1.4 Organization of This Book References Part I Preliminary 2 Basics of Spoken Language Processing 2.1 Overview of Linguistics 2.1.1 Phonetics and Phonology 2.1.2 Morphology and Syntax 2.1.3 Semantics and Pragmatics 2.2 Speech Chain 2.2.1 Speech Production and Articulatory Phonetics Voiced vs Unvoiced and Vowels vs Consonants Source-Filter Model 2.2.2 Speech Transmission and Acoustic Phonetics 2.2.3 Speech Perception and Auditory Phonetics How Human Perceives Sound Difference Between Auditory Perceptions and Physical Property of Sound Evaluation Metrics for Speech Perception 2.3 Speech Signal Processing 2.3.1 Analog-to-Digital Conversion Sampling Quantization 2.3.2 Time to Frequency Domain Transformation Discrete-Time Fourier Transform (DTFT) Discrete Fourier Transform (DFT) Fast Fourier Transform (FFT) Short-Time Fourier Transform (STFT) 2.3.3 Cepstral Analysis 2.3.4 Linear Predictive Coding/Analysis 2.3.5 Speech Parameter Estimation Voiced/Unvoiced/Silent Speech Detection F0 Detection Formant Estimation 2.3.6 Overview of Speech Processing Tasks References 3 Basics of Deep Learning 3.1 Machine Learning Basics 3.1.1 Learning Paradigms Supervised Learning Unsupervised Learning Reinforcement Learning Semi-supervised Learning Self-supervised Learning Pre-training/Fine-Tuning Transfer Learning 3.1.2 Key Components of Machine Learning 3.2 Deep Learning Basics 3.2.1 Model Structures: DNN/CNN/RNN/Self-attention DNN CNN RNN Self-attention Comparison Between Different Structures 3.2.2 Model Frameworks: Encoder/Decoder/Encoder-Decoder Encoder Decoder Encoder-Decoder 3.3 Deep Generative Models 3.3.1 Autoregressive Models 3.3.2 Normalizing Flows 3.3.3 Variational Auto-encoders 3.3.4 Denoising Diffusion Probabilistic Models 3.3.5 Score Matching with Langevin Dynamics, SDEs, and ODEs 3.3.6 Generative Adversarial Networks 3.3.7 Comparisons of Deep Generative Models References Part II Key Components in TTS 4 Text Analyses 4.1 Text Processing 4.1.1 Document Structure Detection 4.1.2 Text Normalization 4.1.3 Linguistic Analysis Sentence Breaking and Type Detection Word/Phrase Segmentation Part-of-Speech Tagging Homograph and Word Sense Disambiguation 4.2 Phonetic Analysis 4.2.1 Polyphone Disambiguation 4.2.2 Grapheme-to-Phoneme Conversion 4.3 Prosodic Analysis 4.3.1 Pause, Stress, and Intonation 4.3.2 Pitch, Duration, and Loudness 4.4 Text Analysis from a Historic Perspective 4.4.1 Text Analysis in SPSS 4.4.2 Text Analysis in Neural TTS References 5 Acoustic Models 5.1 Acoustic Models from a Historic Perspective 5.1.1 Acoustic Models in SPSS 5.1.2 Acoustic Models in Neural TTS 5.2 Acoustic Models with Different Structures 5.2.1 RNN-Based Models (e.g., Tacotron Series) Tacotron Tacotron 2 Other Tacotron Related Acoustic Models 5.2.2 CNN-Based Models (e.g., DeepVoice Series) 5.2.3 Transformer-Based Models (e.g., FastSpeech Series) TransformerTTS FastSpeech FastSpeech 2 5.2.4 Advanced Generative Models (GAN/Flow/VAE/Diffusion) GAN-Based Models Flow-Based Models VAE-Based Models Diffusion-Based Models References 6 Vocoders 6.1 Vocoders from a Historic Perspective 6.1.1 Vocoders in Signal Processing 6.1.2 Vocoders in Neural TTS 6.2 Vocoders with Different Generative Models 6.2.1 Autoregressive Vocoders (e.g., WaveNet) 6.2.2 Flow-Based Vocoders (e.g., Parallel WaveNet, WaveGlow) 6.2.3 GAN-Based Vocoders (e.g., MelGAN, HiFiGAN) 6.2.4 Diffusion-Based Vocoders (e.g., WaveGrad,DiffWave) 6.2.5 Other Vocoders References 7 Fully End-to-End TTS 7.1 Prerequisite Knowledge for Reading This Chapter 7.2 End-to-End TTS from a Historic Perspective 7.2.1 Stage 0: Character→Linguistic→Acoustic→Waveform 7.2.2 Stage 1: Character/Phoneme→Acoustic→Waveform 7.2.3 Stage 2: Character→Linguistic→Waveform 7.2.4 Stage 3: Character/Phoneme→Spectrogram→Waveform 7.2.5 Stage 4: Character/Phoneme→Waveform 7.3 Fully End-to-End Models 7.3.1 Two-Stage Training (e.g., Char2Wav, ClariNet) 7.3.2 One-Stage Training (e.g., FastSpeech 2s,EATS, VITS) 7.3.3 Human-Level Quality (e.g., NaturalSpeech) References Part III Advanced Topics in TTS 8 Expressive and Controllable TTS 8.1 Categorization of Variation Information in Speech 8.1.1 Text/Content Information 8.1.2 Speaker/Timbre Information 8.1.3 Style/Emotion Information 8.1.4 Recording Devices or Noise Environments 8.2 Modeling Variation Information for Expressive Synthesis 8.2.1 Explicit or Implicit Modeling 8.2.2 Modeling in Different Granularities 8.3 Modeling Variation Information for Controllable Synthesis 8.3.1 Disentangling for Control 8.3.2 Improving Controllability 8.3.3 Transfering with Control References 9 Robust TTS 9.1 Improving Generalization Ability 9.2 Improving Text-Speech Alignment 9.2.1 Enhancing Attention 9.2.2 Replacing Attention with Duration Prediction 9.3 Improving Autoregressive Generation 9.3.1 Enhancing AR Generation 9.3.2 Replacing AR Generation with NAR Generation References 10 Model-Efficient TTS 10.1 Parallel Generation 10.1.1 Non-Autoregressive Generation with CNN or Transformer 10.1.2 Non-Autoregressive Generation with GAN, VAE, or Flow 10.1.3 Iterative Generation with Diffusion 10.2 Lightweight Modeling 10.2.1 Model Compression 10.2.2 Neural Architecture Search 10.2.3 Other Technologies 10.3 Efficient Modeling with Domain Knowledge 10.3.1 Linear Prediction 10.3.2 Multiband Modeling 10.3.3 Subscale Prediction 10.3.4 Multi-Frame Prediction 10.3.5 Streaming or Chunk-Wise Synthesis 10.3.6 Other Technologies References 11 Data-Efficient TTS 11.1 Language-Level Data-Efficient TTS 11.1.1 Self-Supervised Training 11.1.2 Cross-Lingual Transfer 11.1.3 Semi-Supervised Training 11.1.4 Mining Dataset in the Wild 11.1.5 Purely Unsupervised Learning 11.2 Speaker-Level Data-Efficient TTS 11.2.1 Improving Generalization 11.2.2 Cross-Domain Adaptation 11.2.3 Few-Data Adaptation 11.2.4 Few-Parameter Adaptation 11.2.5 Zero-Shot Adaptation References 12 Beyond Text-to-Speech Synthesis 12.1 Singing Voice Synthesis 12.1.1 Challenges in Singing Voice Synthesis 12.1.2 Representative Models for Singing Voice Synthesis 12.2 Voice Conversion 12.2.1 Brief Overview of Voice Conversion 12.2.2 Representative Methods for Voice Conversion 12.3 Speech Enhancement/Separation References Part IV Summary and Outlook 13 Summary and Outlook 13.1 Summary 13.2 Future Directions 13.2.1 High-Quality Speech Synthesis 13.2.2 Efficient Speech Synthesis References A Resources of TTS B TTS Model List References