دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش:
نویسندگان: Yang M.Y (ed.)
سری:
ISBN (شابک) : 9780128173589
ناشر: Academic Press
سال نشر: 2019
تعداد صفحات: 419
زبان: English
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 9 مگابایت
در صورت تبدیل فایل کتاب Multimodal scene understanding به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب درک صحنه چند حالته نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
Cover Multimodal Scene Understanding: Algorithms, Applications and Deep Learning Copyright Contents List of Contributors 1 Introduction to Multimodal Scene Understanding 1.1 Introduction 1.2 Organization of the Book References 2 Deep Learning for Multimodal Data Fusion 2.1 Introduction 2.2 Related Work 2.3 Basics of Multimodal Deep Learning: VAEs and GANs 2.3.1 Auto-Encoder 2.3.2 Variational Auto-Encoder (VAE) 2.3.3 Generative Adversarial Network (GAN) 2.3.4 VAE-GAN 2.3.5 Adversarial Auto-Encoder (AAE) 2.3.6 Adversarial Variational Bayes (AVB) 2.3.7 ALI and BiGAN 2.4 Multimodal Image-to-Image Translation Networks 2.4.1 Pix2pix and Pix2pixHD 2.4.2 CycleGAN, DiscoGAN, and DualGAN 2.4.3 CoGAN 2.4.4 UNIT 2.4.5 Triangle GAN 2.5 Multimodal Encoder-Decoder Networks 2.5.1 Model Architecture 2.5.2 Multitask Training 2.5.3 Implementation Details 2.6 Experiments 2.6.1 Results on NYUDv2 Dataset 2.6.2 Results on Cityscape Dataset 2.6.3 Auxiliary Tasks 2.7 Conclusion References 3 Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks 3.1 Introduction 3.2 Overview 3.2.1 Image Classification and the VGG Network 3.2.2 Architectures for Pixel-level Labeling 3.2.3 Architectures for RGB and Depth Fusion 3.2.4 Datasets and Benchmarks 3.3 Methods 3.3.1 Datasets and Data Splitting 3.3.2 Preprocessing of the Stanford Dataset 3.3.3 Preprocessing of the ISPRS Dataset 3.3.4 One-channel Normal Label Representation 3.3.5 Color Spaces for RGB and Depth Fusion 3.3.6 Hyper-parameters and Training 3.4 Results and Discussion 3.4.1 Results and Discussion on the Stanford Dataset 3.4.2 Results and Discussion on the ISPRS Dataset 3.5 Conclusion References 4 Learning Convolutional Neural Networks for Object Detection with Very Little Training Data 4.1 Introduction 4.2 Fundamentals 4.2.1 Types of Learning 4.2.2 Convolutional Neural Networks 4.2.2.1 Artificial neuron 4.2.2.2 Artificial neural network 4.2.2.3 Training 4.2.2.4 Convolutional neural networks 4.2.3 Random Forests 4.2.3.1 Decision tree 4.2.3.2 Random forest 4.3 Related Work 4.4 Traffic Sign Detection 4.4.1 Feature Learning 4.4.2 Random Forest Classification 4.4.3 RF to NN Mapping 4.4.4 Fully Convolutional Network 4.4.5 Bounding Box Prediction 4.5 Localization 4.6 Clustering 4.7 Dataset 4.7.1 Data Capturing 4.7.2 Filtering 4.8 Experiments 4.8.1 Training and Test Data 4.8.2 Classification 4.8.3 Object Detection 4.8.4 Computation Time 4.8.5 Precision of Localizations 4.9 Conclusion Acknowledgment References 5 Multimodal Fusion Architectures for Pedestrian Detection 5.1 Introduction 5.2 Related Work 5.2.1 Visible Pedestrian Detection 5.2.2 Infrared Pedestrian Detection 5.2.3 Multimodal Pedestrian Detection 5.3 Proposed Method 5.3.1 Multimodal Feature Learning/Fusion 5.3.2 Multimodal Pedestrian Detection 5.3.2.1 Baseline DNN model 5.3.2.2 Scene-aware DNN model 5.3.3 Multimodal Segmentation Supervision 5.4 Experimental Results and Discussion 5.4.1 Dataset and Evaluation Metric 5.4.2 Implementation Details 5.4.3 Evaluation of Multimodal Feature Fusion 5.4.4 Evaluation of Multimodal Pedestrian Detection Networks 5.4.5 Evaluation of Multimodal Segmentation Supervision Networks 5.4.6 Comparison with State-of-the-Art Multimodal Pedestrian Detection Methods 5.5 Conclusion Acknowledgment References 6 Multispectral Person Re-Identification Using GAN for Color-to-Thermal Image Translation 6.1 Introduction 6.2 Related Work 6.2.1 Person Re-Identification 6.2.2 Color-to-Thermal Translation 6.2.3 Generative Adversarial Networks 6.3 ThermalWorld Dataset 6.3.1 ThermalWorld ReID Split 6.3.2 ThermalWorld VOC Split 6.3.3 Dataset Annotation 6.3.4 Comparison of the ThermalWorld VOC Split with Previous Datasets 6.3.5 Dataset Structure 6.3.6 Data Processing 6.4 Method 6.4.1 Conditional Adversarial Networks 6.4.2 Thermal Segmentation Generator 6.4.3 Relative Thermal Contrast Generator 6.4.4 Thermal Signature Matching 6.5 Evaluation 6.5.1 Network Training 6.5.2 Color-to-Thermal Translation 6.5.2.1 Qualitative comparison 6.5.2.2 Quantitative evaluation 6.5.3 ReID Evaluation Protocol 6.5.4 Cross-modality ReID Baselines 6.5.5 Comparison and Analysis 6.5.6 Applications 6.6 Conclusion Acknowledgments References 7 A Review and Quantitative Evaluation of Direct Visual-Inertial Odometry 7.1 Introduction 7.2 Related Work 7.2.1 Visual Odometry 7.2.2 Visual-Inertial Odometry 7.3 Background: Nonlinear Optimization and Lie Groups 7.3.1 Gauss-Newton Algorithm 7.3.2 Levenberg-Marquandt Algorithm 7.4 Background: Direct Sparse Odometry 7.4.1 Notation 7.4.2 Photometric Error 7.4.3 Interaction Between Coarse Tracking and Joint Optimization 7.4.4 Coarse Tracking Using Direct Image Alignment 7.4.5 Joint Optimization 7.5 Direct Sparse Visual-Inertial Odometry 7.5.1 Inertial Error 7.5.2 IMU Initialization and the Problem of Observability 7.5.3 SIM(3)-based Model 7.5.4 Scale-Aware Visual-Inertial Optimization 7.5.4.1 Nonlinear optimization 7.5.4.2 Marginalization using the Schur complement 7.5.4.3 Dynamic marginalization for delayed scale convergence 7.5.4.4 Measuring scale convergence 7.5.5 Coarse Visual-Inertial Tracking 7.6 Calculating the Relative Jacobians 7.6.1 Proof of the Chain Rule 7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58) 7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59) 7.7 Results 7.7.1 Robust Quantitative Evaluation 7.7.2 Evaluation of the Initialization 7.7.3 Parameter Studies 7.8 Conclusion References 8 Multimodal Localization for Embedded Systems: A Survey 8.1 Introduction 8.2 Positioning Systems and Perception Sensors 8.2.1 Positioning Systems 8.2.1.1 Inertial navigation systems 8.2.1.2 Global navigation satellite systems 8.2.2 Perception Sensors 8.2.2.1 Visible light cameras 8.2.2.2 IR cameras 8.2.2.3 Event-based cameras 8.2.2.4 RGB-D cameras 8.2.2.5 LiDAR sensors 8.2.3 Heterogeneous Sensor Data Fusion Methods 8.2.3.1 Sensor configuration types 8.2.3.2 Sensor coupling approaches 8.2.3.3 Sensors fusion architectures 8.2.4 Discussion 8.3 State of the Art on Localization Methods 8.3.1 Monomodal Localization 8.3.1.1 INS-based localization 8.3.1.2 GNSS-based localization 8.3.1.3 Image-based localization 8.3.1.4 LiDAR-map based localization 8.3.2 Multimodal Localization 8.3.2.1 Classical data fusion algorithms 8.3.2.2 Reference multimodal benchmarks 8.3.2.3 A panorama of multimodal localization approaches 8.3.2.4 Graph-based localization 8.3.3 Discussion 8.4 Multimodal Localization for Embedded Systems 8.4.1 Application Domain and Hardware Constraints 8.4.2 Embedded Computing Architectures 8.4.2.1 SoC constraints 8.4.2.2 IP modules for SoC 8.4.2.3 SoC 8.4.2.4 FPGA 8.4.2.5 ASIC 8.4.2.6 Discussion 8.4.3 Multimodal Localization in State-of-the-Art Embedded Systems 8.4.3.1 Example of embedded SoC for multimodal localization 8.4.3.2 Smart phones 8.4.3.3 Smart glasses 8.4.3.4 Autonomous mobile robots 8.4.3.5 Unmanned aerial vehicles 8.4.3.6 Autonomous driving vehicles 8.4.4 Discussion 8.5 Application Domains 8.5.1 Scene Mapping 8.5.1.1 Aircraft inspection 8.5.1.2 SenseFly eBee classic 8.5.2 Pedestrian Localization 8.5.2.1 Indoor localization in large-scale buildings 8.5.2.2 Precise localization of mobile devices in unknown environments 8.5.3 Automotive Navigation 8.5.3.1 Autonomous driving 8.5.3.2 Smart factory 8.5.4 Mixed Reality 8.5.4.1 Virtual cane system for visually impaired individuals 8.5.4.2 Engineering, construction and maintenance 8.6 Conclusion References 9 Self-Supervised Learning from Web Data for Multimodal Retrieval 9.1 Introduction 9.1.1 Annotating Data: A Bottleneck for Training Deep Neural Networks 9.1.2 Alternatives to Annotated Data 9.1.3 Exploiting Multimodal Web Data 9.2 Related Work 9.2.1 Contributions 9.3 Multimodal Text-Image Embedding 9.4 Text Embeddings 9.5 Benchmarks 9.5.1 InstaCities1M 9.5.2 WebVision 9.5.3 MIRFlickr 9.6 Retrieval on InstaCities1M and WebVision Datasets 9.6.1 Experiment Setup 9.6.2 Results and Conclusions 9.6.3 Error Analysis 9.6.3.1 Visual features confusion 9.6.3.2 Errors from the dataset statistics 9.6.3.3 Words with different meanings or uses 9.7 Retrieval in the MIRFlickr Dataset 9.7.1 Experiment Setup 9.7.2 Results and Conclusions 9.8 Comparing the Image and Text Embeddings 9.8.1 Experiment Setup 9.8.2 Results and Conclusions 9.9 Visualizing CNN Activation Maps 9.10 Visualizing the Learned Semantic Space with t-SNE 9.10.1 Dimensionality Reduction with t-SNE 9.10.2 Visualizing Both Image and Text Embeddings 9.10.3 Showing Images at the Embedding Locations 9.10.4 Semantic Space Inspection 9.11 Conclusions Acknowledgments References 10 3D Urban Scene Reconstruction and Interpretation from Multisensor Imagery 10.1 Introduction 10.2 Pose Estimation for Wide-Baseline Image Sets 10.2.1 Pose Estimation for Wide-Baseline Pairs and Triplets 10.2.2 Hierarchical Merging of Triplets 10.2.3 Automatic Determination of Overlap 10.3 Dense 3D Reconstruction 10.3.1 Dense Depth Map Generation and Uncertainty Estimation 10.3.2 3D Uncertainty Propagation and 3D Reconstruction 10.4 Scene Classification 10.4.1 Relative Features 10.4.1.1 Color coherence 10.4.1.2 Definition of neighborhood 10.4.1.3 Relative height 10.4.1.4 Coplanarity of 3D points 10.4.2 Classification and Results 10.4.2.1 Post-processing 10.4.2.2 Results for Bonnland 10.5 Scene and Building Decomposition 10.5.1 Scene Decomposition 10.5.2 Building Decomposition 10.5.2.1 Ridge extraction 10.5.2.2 Primitive-based building decomposition 10.6 Building Modeling 10.6.1 Primitive Selection and Optimization 10.6.2 Primitive Assembly 10.6.3 LoD2 Models 10.6.4 Detection of Facade Elements 10.6.5 Shell Model 10.7 Conclusion and Future Work References 11 Decision Fusion of Remote-Sensing Data for Land Cover Classification 11.1 Introduction 11.1.1 Review of the Main Data Fusion Methods 11.1.1.1 Early fusion - fusion at the observation level 11.1.1.2 Intermediate fusion - fusion at the attribute/feature level 11.1.1.3 Late fusion - fusion at the decision level 11.1.2 Discussion and Proposal of a Strategy 11.2 Proposed Framework 11.2.1 Fusion Rules 11.2.1.1 Fuzzy rules 11.2.1.2 Bayesian combination and majority vote 11.2.1.3 Margin-based rules 11.2.1.4 Dempster-Shafer evidence theory 11.2.1.5 Supervised fusion rules: learning based approaches 11.2.2 Global Regularization 11.2.2.1 Model formulation(s) 11.2.2.2 Optimization 11.2.2.3 Parameter tuning 11.3 Use Case #1: Hyperspectral and Very High Resolution Multispectral Imagery for Urban Material Discrimination 11.3.1 Introduction 11.3.2 Fusion Process 11.3.3 Datasets 11.3.4 Results and Discussion 11.3.4.1 Source comparison 11.3.4.2 Decision fusion classification 11.3.4.3 Regularization 11.3.5 Conclusion 11.4 Use Case #2: Urban Footprint Detection 11.4.1 Introduction 11.4.2 Proposed Framework: A Two-Step Urban Footprint Detection 11.4.2.1 Initial classifications 11.4.2.2 First regularization 11.4.2.3 Binary classification and fusion 11.4.3 Data 11.4.4 Results 11.4.4.1 Five-class classifications 11.4.4.2 Urban footprint extraction 11.4.5 Conclusion 11.5 Final Outlook and Perspectives References 12 Cross-modal Learning by Hallucinating Missing Modalities in RGB-D Vision 12.1 Introduction 12.2 Related Work 12.2.1 Generalized Distillation 12.2.2 Multimodal Video Action Recognition 12.3 Generalized Distillation with Multiple Stream Networks 12.3.1 Cross-stream Multiplier Networks 12.3.2 Hallucination Stream 12.3.3 Training Paradigm 12.4 Experiments 12.4.1 Datasets 12.4.2 Pre-processing and Alignment of RGB and Depth Frames 12.4.3 Hyperparameters and Validation Set 12.4.4 Ablation Study 12.4.4.1 Contribution of the cross-stream connections 12.4.4.2 Contributions of the proposed distillation loss (Eq. (12.5)) 12.4.4.3 Contributions of the proposed training procedure 12.4.5 Inference with Noisy Depth 12.4.6 Comparison with Other Methods 12.4.7 Inverting Modalities - RGB Distillation 12.5 Conclusions and Future Work References Index Back Cover