دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش: نویسندگان: Ling Zhenhua, Gao Jianqing, Yu Kai, Jia Jia سری: Communications in Computer and Information Science, 1765 ISBN (شابک) : 9819924006, 9789819924004 ناشر: Springer سال نشر: 2023 تعداد صفحات: 341 [342] زبان: English فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) حجم فایل: 27 Mb
در صورت تبدیل فایل کتاب Man-Machine Speech Communication: 17th National Conference, NCMMSC 2022, Hefei, China, December 15–18, 2022, Proceedings به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب ارتباط گفتار انسان و ماشین: هفدهمین کنفرانس ملی، NCMMSC 2022، هیفی، چین، 15 تا 18 دسامبر 2022، مجموعه مقالات نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
این کتاب مجموعه مقالات داوری هفدهمین کنفرانس ملی ارتباطات گفتار انسان و ماشین، NCMMSC 2022، در چین، در دسامبر 2022 است. 21 مقاله کامل و 7 مقاله کوتاه موجود در این کتاب با دقت بررسی و از بین 108 مورد ارسالی انتخاب شدند. آنها در بخش های موضوعی به شرح زیر سازماندهی شدند: MCPN: شبکه ادراک متقاطع متعدد برای تشخیص احساسات در زمان واقعی در مکالمه. مجموعه داده سنتز
This book constitutes the refereed proceedings of the 17th National Conference on Man–Machine Speech Communication, NCMMSC 2022, held in China, in December 2022. The 21 full papers and 7 short papers included in this book were carefully reviewed and selected from 108 submissions. They were organized in topical sections as follows: MCPN: A Multiple Cross-Perception Network for Real-Time Emotion Recognition in Conversation.- Baby Cry Recognition Based on Acoustic Segment Model, MnTTS2 An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset.
Preface Organization Contents MCPN: A Multiple Cross-Perception Network for Real-Time Emotion Recognition in Conversation 1 Introduction 2 Related Work 2.1 Emotion Recognition in Conversation 2.2 Dynamical Influence Model 3 Methodology 3.1 Problem Definition 3.2 Multimodal Utterance Feature Extraction 3.3 CPP: Context Pre-perception Module 3.4 MCP: Multiple Cross-Perception Module 3.5 Emotion Triple-Recognition Process 3.6 Loss Function 4 Experimental Settings 4.1 Datasets 4.2 Implementation Details 4.3 Baseline Methods 5 Results and Analysis 5.1 Overall Performance 5.2 Variants of Various Modalities 5.3 Effectiveness of State Interaction Interval 5.4 Performance on Similar Emotion Classification 5.5 Ablation Study 5.6 Error Study 6 Conclusion References Baby Cry Recognition Based on Acoustic Segment Model 1 Introduction 2 Method 2.1 Acoustic Segment Model 2.2 Latent Semantic Analysis 2.3 DNN Classifier 3 Experiments and Analysis 3.1 Database and Data Preprocessing 3.2 Ablation Experiments 3.3 Overall Comparison 3.4 Results Analysis 4 Conclusions References A Multi-feature Sets Fusion Strategy with Similar Samples Removal for Snore Sound Classification 1 Introduction 2 Materials and Methods 2.1 MPSSC Database 2.2 Feature Extraction 2.3 Classification Model 3 Experimental Setups 4 Results and Discussion 4.1 Classification Results 4.2 Limitations and Perspectives 5 Conclusion References Multi-hypergraph Neural Networks for Emotion Recognition in Multi-party Conversations 1 Introduction 2 Related Work 2.1 Emotion Recognition in Conversations 2.2 Hypergraph Neural Network 3 Methodology 3.1 Hypergraph Definition 3.2 Problem Definition 3.3 Model 3.4 Classifier 4 Experimental Setting 4.1 Datasets 4.2 Compared Methods 4.3 Implementation Details 5 Results and Discussions 5.1 Overall Performance 5.2 Ablation Study 5.3 Effect of Depths of GNN and Window Sizes 5.4 Error Analysis 6 Conclusion References Using Emoji as an Emotion Modality in Text-Based Depression Detection 1 Introduction 2 Emoji Extraction and Depression Detection 2.1 Emotion and Semantic Features 2.2 Depression Detection Model 3 Experiments 4 Results 4.1 Depression Detection on Social Media Text 4.2 Depression Detection on Dialogue Text 5 Analysis 6 Conclusion References Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis 1 Introduction 2 Proposed Method 2.1 Overview 2.2 Source Module 2.3 Resolution-Wise Conditional Filter Module 3 Experiments 3.1 Experimental Setup 3.2 Comparison Among Neural Vocoders 3.3 Ablation Studies 4 Conclusion References Semantic Enhancement Framework for Robust Speech Recognition 1 Introduction 2 Related Work 2.1 Contextual Method 2.2 Adaptive Method 3 Method 3.1 Hybrid CTC/Attention Architecture 3.2 Pre-train Language Model 3.3 Semantic Enhancement Framework 3.4 Evaluation Metrics 4 Experiment 4.1 Dataset 4.2 Configuration 4.3 Impact of Losses 4.4 Results 5 Conclusions References Achieving Timestamp Prediction While Recognizing with Non-autoregressive End-to-End ASR Model 1 Introduction 2 Related Works 3 Preliminaries 3.1 Continuous Integrate-and-Fire 3.2 Paraformer 4 Methods 4.1 Scaled-CIF Training 4.2 Weights Post-processing 4.3 Evaluation Metrics 5 Experiments and Results 5.1 Datasets 5.2 Experiment Setup 5.3 Quality of Timestamp 5.4 ASR Results 6 Conclusion References Predictive AutoEncoders Are Context-Aware Unsupervised Anomalous Sound Detectors 1 Introduction 2 Related Work 2.1 Unsupervised Anomalous Sound Detection 2.2 Transformer 3 Proposed Method 3.1 Self-attention Mechanism 3.2 The Architecture of Predictive AutoEncoder 3.3 Training Strategy 4 Experiments and Results 4.1 Experimental Setup 4.2 Results 5 Conclusion References A Pipelined Framework with Serialized Output Training for Overlapping Speech Recognition 1 Introduction 2 Overview and the Proposed Methods 2.1 Pipelined Model 2.2 Serialized Output Training 2.3 Proposed Method 3 Experimental Settings 3.1 DataSet 3.2 Training and Evaluation Metric 3.3 Model Settings 4 Result Analysis 4.1 Baseline Results 4.2 Results of Proposed Method 5 Conclusion References Adversarial Training Based on Meta-Learning in Unseen Domains for Speaker Verification 1 Introduction 2 Overview of the Proposed Network 3 Method 3.1 Adversarial Training with Multi-task Learning 3.2 Improved Episode-Level Balanced Sampling 3.3 Domain-Invariant Attention Module 4 Experiments and Analysis 4.1 Experimental Settings 4.2 Comparative Experiments 5 Conclusion References Multi-speaker Multi-style Speech Synthesis with Timbre and Style Disentanglement 1 Introduction 2 The Proposed Model 2.1 The Network Structure of Proposed Network 2.2 Utterance Level Feature Normalization 3 Experimental Setup 4 Experimental Results 4.1 Subjective Evaluation 4.2 Ablation Study of Utterance Level Feature Normalization 4.3 Demonstration of the Proposed Model 4.4 Style Transition Illustration 5 Conclusions References Multiple Confidence Gates for Joint Training of SE and ASR 1 Introduction 2 Our Method 2.1 Multiple Confidence Gates Enhancement Module 2.2 Automatic Speech Recognition 2.3 Loss Function 3 Experiments 3.1 Dataset 3.2 Training Setup and Baseline 3.3 Experimental Results and Discussion 4 Conclusions References Detecting Escalation Level from Speech with Transfer Learning and Acoustic-Linguistic Information Fusion 1 Introduction 2 Related Works 2.1 Conflict Escalation Detection 2.2 Transfer Learning 2.3 Textual Embeddings 3 Datasets and Methods 3.1 Datasets 3.2 Methods 4 Experimental Results 4.1 Feature Configuration 4.2 Model Setup 4.3 Results 5 Discussion 6 Conclusions References Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement 1 Introduction 2 TTS Model 2.1 Text Analysis Module 2.2 Acoustic Model 2.3 Vocoder 3 The Proposed Approach 3.1 Frame-Wise Phoneme Classification 3.2 Semi-supervised Pre-training 3.3 AdaSpeech Fine-Tuning 4 Experiments 4.1 Single-Speaker Mandarin Task 4.2 Multi-speaker Chinese Dialects Task 5 Conclusions References A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition 1 Introduction 2 Overview of the Speech Emotion Recognition Architecture 3 Methods 3.1 TC Self-attention Module 3.2 F Domain-Attention Module 4 Experiments 4.1 Dataset and Acoustic Features 4.2 System Description 5 Conclusion References Interplay Between Prosody and Syntax-Semantics: Evidence from the Prosodic Features of Mandarin Tag Questions 1 Introduction 2 Method 2.1 Participants 2.2 Stimuli 2.3 Procedure 2.4 Acoustic Analysis 3 Results 3.1 Fluctuation Scale 3.2 Duration Ratio 3.3 Intensity Ratio 4 Discussion References Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis 1 Introduction 2 Methodology 2.1 Fine-Grained Emotion Strengths from Ranking Function 2.2 The Proposed Method 3 Experiments 3.1 The Model Architecture of Baseline Emotional TTS 3.2 Tasks for Experiments and Models Setup 3.3 Basic Setups 3.4 Task-I: Evaluating the Proposed Method for Non-transferred Emotional Speech Synthesis 3.5 Task-II: Evaluating the Proposed Method for Cross-Speaker Emotion Transfer 3.6 Analysis of Manually Assigning Emotion Strengths for Both Task-I and Task-II 4 Conclusions References Violence Detection Through Fusing Visual Information to Auditory Scene 1 Introduction 2 Methods 2.1 CNN-ConvLSTM Model 2.2 Attention Module 2.3 Audio-Visual Information Fusion 3 Experiments and Results 3.1 Datasets 3.2 Audio Violence Detection 3.3 Audio-Visual Violence Detection 4 Conclusion References Mongolian Text-to-Speech Challenge Under Low-Resource Scenario for NCMMSC2022 1 Introduction 2 Voices to Build 2.1 Speech Dataset 2.2 Task 3 Participants 4 Evaluations and Results 4.1 Evaluation Materials 4.2 Evaluation Metrics 4.3 Results References VC-AUG: Voice Conversion Based Data Augmentation for Text-Dependent Speaker Verification 1 Introduction 2 Related Works 2.1 Speaker Verification System 2.2 Voice Conversion System 3 Methods 3.1 Pre-training and Fine-tuning 3.2 Data Augmentation Based on the VC System 3.3 Data Augmentation Based on the TTS System 3.4 Speaker Augmentation Based on Speed Perturbation 4 Experimental Results 5 Conclusion References Transformer-Based Potential Emotional Relation Mining Network for Emotion Recognition in Conversation 1 Introduction 2 Related Work 2.1 Emotion Recognition in Conversation 3 Task Definition 4 Proposed Method 4.1 Utterance Feature Extraction 4.2 Emotion Extraction Module 4.3 PERformer Module 4.4 Emotion Classifier 4.5 Datasets 5 Experiments 5.1 Datasets 5.2 Implementation Details 5.3 Evaluation Metrics 5.4 Comparing Methods and Metrics 5.5 Compared with the State-of-the-art Method 5.6 Ablation Study 5.7 Analysis on Parameters 5.8 Error Study 6 Conclusion References FastFoley: Non-autoregressive Foley Sound Generation Based on Visual Semantics 1 Introduction 2 Method 2.1 Audio and Visual Features Extraction 2.2 Acoustic Model 3 Dataset 4 Experiments and Results 4.1 Implementation Details 4.2 Experiments and Results 5 Conclusion References Structured Hierarchical Dialogue Policy with Graph Neural Networks 1 Introduction 2 Related Work 3 Hierarchical Reinforcement Learning 4 ComNet 4.1 Composite Dialogue 4.2 Graph Construction 4.3 ComNet as Policy Network 5 Experiments 5.1 PyDial Benchmark 5.2 Implementation 5.3 Analysis 5.4 Transferability 6 Conclusion References Deep Reinforcement Learning for On-line Dialogue State Tracking 1 Introduction 2 Related Work 3 On-line DST via Interaction 3.1 Input and Output 3.2 Tracking Policy 3.3 Reward Signal 4 Implementation Detail 4.1 Auxiliary Polynomial Tracker 4.2 Tracking Agents 4.3 DDPG for Tracking Policy 5 Joint Training Process 6 Experiments 6.1 Dataset 6.2 Systems 6.3 DRL-based DST Evaluation 6.4 Joint Training Evaluation 7 Conclusion References Dual Learning for Dialogue State Tracking 1 Introduction 2 Tracker and Dual Task 2.1 Coarse-to-Fine State Tracker 2.2 Dual Task 3 Dual Learning for DST 4 Experiments 4.1 Dataset 4.2 Training Details 4.3 Baseline Methods 4.4 Results 5 Related Work 6 Conclusion References Automatic Stress Annotation and Prediction for Expressive Mandarin TTS*-12pt 1 Introduction 2 Methodology 2.1 Proposed Method for Stress Detection 2.2 Textual-Level Stress Prediction 2.3 Modeling Stress in Acoustic Model 3 Experiments 3.1 Complete Stress-Controllable TTS System 3.2 Experimental Results 4 Conclusion References MnTTS2: An Open-Source Multi-speaker Mongolian Text-to-Speech Synthesis Dataset*-12pt 1 Introduction 2 Related Work 3 MnTTS2 Dataset 3.1 MnTTS 3.2 MnTTS2 4 Speech Synthesis Experiments 4.1 Experimental Setup 4.2 Naturalness Evaluation 4.3 Speaker Similarity Evaluation 5 Challenges and Future Work 6 Conclusion References Author Index