ورود به حساب

نام کاربری گذرواژه

گذرواژه را فراموش کردید؟ کلیک کنید

حساب کاربری ندارید؟ ساخت حساب

ساخت حساب کاربری

نام نام کاربری ایمیل شماره موبایل گذرواژه

برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید


09117307688
09117179751

در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید

دسترسی نامحدود

برای کاربرانی که ثبت نام کرده اند

ضمانت بازگشت وجه

درصورت عدم همخوانی توضیحات با کتاب

پشتیبانی

از ساعت 7 صبح تا 10 شب

دانلود کتاب Multimodal scene understanding

دانلود کتاب درک صحنه چند حالته

Multimodal scene understanding

مشخصات کتاب

Multimodal scene understanding

ویرایش:  
نویسندگان:   
سری:  
ISBN (شابک) : 9780128173589 
ناشر: Academic Press 
سال نشر: 2019 
تعداد صفحات: 419 
زبان: English 
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) 
حجم فایل: 9 مگابایت 

قیمت کتاب (تومان) : 42,000



ثبت امتیاز به این کتاب

میانگین امتیاز به این کتاب :
       تعداد امتیاز دهندگان : 22


در صورت تبدیل فایل کتاب Multimodal scene understanding به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.

توجه داشته باشید کتاب درک صحنه چند حالته نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.


توضیحاتی درمورد کتاب به خارجی



فهرست مطالب

Cover
Multimodal Scene Understanding:

Algorithms, Applications and Deep Learning
Copyright
Contents
List of Contributors
1 Introduction to Multimodal Scene Understanding
	1.1 Introduction
	1.2 Organization of the Book
	References
2 Deep Learning for Multimodal Data Fusion
	2.1 Introduction
	2.2 Related Work
	2.3 Basics of Multimodal Deep Learning: VAEs and GANs
		2.3.1 Auto-Encoder
		2.3.2 Variational Auto-Encoder (VAE)
		2.3.3 Generative Adversarial Network (GAN)
		2.3.4 VAE-GAN
		2.3.5 Adversarial Auto-Encoder (AAE)
		2.3.6 Adversarial Variational Bayes (AVB)
		2.3.7 ALI and BiGAN
	2.4 Multimodal Image-to-Image Translation Networks
		2.4.1 Pix2pix and Pix2pixHD
		2.4.2 CycleGAN, DiscoGAN, and DualGAN
		2.4.3 CoGAN
		2.4.4 UNIT
		2.4.5 Triangle GAN
	2.5 Multimodal Encoder-Decoder Networks
		2.5.1 Model Architecture
		2.5.2 Multitask Training
		2.5.3 Implementation Details
	2.6 Experiments
		2.6.1 Results on NYUDv2 Dataset
		2.6.2 Results on Cityscape Dataset
		2.6.3 Auxiliary Tasks
	2.7 Conclusion
	References
3 Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks
	3.1 Introduction
	3.2 Overview
		3.2.1 Image Classification and the VGG Network
		3.2.2 Architectures for Pixel-level Labeling
		3.2.3 Architectures for RGB and Depth Fusion
		3.2.4 Datasets and Benchmarks
	3.3 Methods
		3.3.1 Datasets and Data Splitting
		3.3.2 Preprocessing of the Stanford Dataset
		3.3.3 Preprocessing of the ISPRS Dataset
		3.3.4 One-channel Normal Label Representation
		3.3.5 Color Spaces for RGB and Depth Fusion
		3.3.6 Hyper-parameters and Training
	3.4 Results and Discussion
		3.4.1 Results and Discussion on the Stanford Dataset
		3.4.2 Results and Discussion on the ISPRS Dataset
	3.5 Conclusion
	References
4 Learning Convolutional Neural Networks for Object Detection with Very Little Training Data
	4.1 Introduction
	4.2 Fundamentals
		4.2.1 Types of Learning
		4.2.2 Convolutional Neural Networks
			4.2.2.1 Artificial neuron
			4.2.2.2 Artificial neural network
			4.2.2.3 Training
			4.2.2.4 Convolutional neural networks
		4.2.3 Random Forests
			4.2.3.1 Decision tree
			4.2.3.2 Random forest
	4.3 Related Work
	4.4 Traffic Sign Detection
		4.4.1 Feature Learning
		4.4.2 Random Forest Classification
		4.4.3 RF to NN Mapping
		4.4.4 Fully Convolutional Network
		4.4.5 Bounding Box Prediction
	4.5 Localization
	4.6 Clustering
	4.7 Dataset
		4.7.1 Data Capturing
		4.7.2 Filtering
	4.8 Experiments
		4.8.1 Training and Test Data
		4.8.2 Classification
		4.8.3 Object Detection
		4.8.4 Computation Time
		4.8.5 Precision of Localizations
	4.9 Conclusion
	Acknowledgment
	References
5 Multimodal Fusion Architectures for Pedestrian Detection
	5.1 Introduction
	5.2 Related Work
		5.2.1 Visible Pedestrian Detection
		5.2.2 Infrared Pedestrian Detection
		5.2.3 Multimodal Pedestrian Detection
	5.3 Proposed Method
		5.3.1 Multimodal Feature Learning/Fusion
		5.3.2 Multimodal Pedestrian Detection
			5.3.2.1 Baseline DNN model
			5.3.2.2 Scene-aware DNN model
		5.3.3 Multimodal Segmentation Supervision
	5.4 Experimental Results and Discussion
		5.4.1 Dataset and Evaluation Metric
		5.4.2 Implementation Details
		5.4.3 Evaluation of Multimodal Feature Fusion
		5.4.4 Evaluation of Multimodal Pedestrian Detection Networks
		5.4.5 Evaluation of Multimodal Segmentation Supervision Networks
		5.4.6 Comparison with State-of-the-Art Multimodal Pedestrian Detection Methods
	5.5 Conclusion
	Acknowledgment
	References
6 Multispectral Person Re-Identification Using GAN for Color-to-Thermal Image Translation
	6.1 Introduction
	6.2 Related Work
		6.2.1 Person Re-Identification
		6.2.2 Color-to-Thermal Translation
		6.2.3 Generative Adversarial Networks
	6.3 ThermalWorld Dataset
		6.3.1 ThermalWorld ReID Split
		6.3.2 ThermalWorld VOC Split
		6.3.3 Dataset Annotation
		6.3.4 Comparison of the ThermalWorld VOC Split with Previous Datasets
		6.3.5 Dataset Structure
		6.3.6 Data Processing
	6.4 Method
		6.4.1 Conditional Adversarial Networks
		6.4.2 Thermal Segmentation Generator
		6.4.3 Relative Thermal Contrast Generator
		6.4.4 Thermal Signature Matching
	6.5 Evaluation
		6.5.1 Network Training
		6.5.2 Color-to-Thermal Translation
			6.5.2.1 Qualitative comparison
			6.5.2.2 Quantitative evaluation
		6.5.3 ReID Evaluation Protocol
		6.5.4 Cross-modality ReID Baselines
		6.5.5 Comparison and Analysis
		6.5.6 Applications
	6.6 Conclusion
	Acknowledgments
	References
7 A Review and Quantitative Evaluation of Direct Visual-Inertial Odometry
	7.1 Introduction
	7.2 Related Work
		7.2.1 Visual Odometry
		7.2.2 Visual-Inertial Odometry
	7.3 Background: Nonlinear Optimization and Lie Groups
		7.3.1 Gauss-Newton Algorithm
		7.3.2 Levenberg-Marquandt Algorithm
	7.4 Background: Direct Sparse Odometry
		7.4.1 Notation
		7.4.2 Photometric Error
		7.4.3 Interaction Between Coarse Tracking and Joint Optimization
		7.4.4 Coarse Tracking Using Direct Image Alignment
		7.4.5 Joint Optimization
	7.5 Direct Sparse Visual-Inertial Odometry
		7.5.1 Inertial Error
		7.5.2 IMU Initialization and the Problem of Observability
		7.5.3 SIM(3)-based Model
		7.5.4 Scale-Aware Visual-Inertial Optimization
			7.5.4.1 Nonlinear optimization
			7.5.4.2 Marginalization using the Schur complement
			7.5.4.3 Dynamic marginalization for delayed scale convergence
			7.5.4.4 Measuring scale convergence
		7.5.5 Coarse Visual-Inertial Tracking
	7.6 Calculating the Relative Jacobians
		7.6.1 Proof of the Chain Rule
		7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58)
		7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59)
	7.7 Results
		7.7.1 Robust Quantitative Evaluation
		7.7.2 Evaluation of the Initialization
		7.7.3 Parameter Studies
	7.8 Conclusion
	References
8 Multimodal Localization for Embedded Systems: A Survey
	8.1 Introduction
	8.2 Positioning Systems and Perception Sensors
		8.2.1 Positioning Systems
			8.2.1.1 Inertial navigation systems
			8.2.1.2 Global navigation satellite systems
		8.2.2 Perception Sensors
			8.2.2.1 Visible light cameras
			8.2.2.2 IR cameras
			8.2.2.3 Event-based cameras
			8.2.2.4 RGB-D cameras
			8.2.2.5 LiDAR sensors
		8.2.3 Heterogeneous Sensor Data Fusion Methods
			8.2.3.1 Sensor configuration types
			8.2.3.2 Sensor coupling approaches
			8.2.3.3 Sensors fusion architectures
		8.2.4 Discussion
	8.3 State of the Art on Localization Methods
		8.3.1 Monomodal Localization
			8.3.1.1 INS-based localization
			8.3.1.2 GNSS-based localization
			8.3.1.3 Image-based localization
			8.3.1.4 LiDAR-map based localization
		8.3.2 Multimodal Localization
			8.3.2.1 Classical data fusion algorithms
			8.3.2.2 Reference multimodal benchmarks
			8.3.2.3 A panorama of multimodal localization approaches
			8.3.2.4 Graph-based localization
		8.3.3 Discussion
	8.4 Multimodal Localization for Embedded Systems
		8.4.1 Application Domain and Hardware Constraints
		8.4.2 Embedded Computing Architectures
			8.4.2.1 SoC constraints
			8.4.2.2 IP modules for SoC
			8.4.2.3 SoC
			8.4.2.4 FPGA
			8.4.2.5 ASIC
			8.4.2.6 Discussion
		8.4.3 Multimodal Localization in State-of-the-Art Embedded Systems
			8.4.3.1 Example of embedded SoC for multimodal localization
			8.4.3.2 Smart phones
			8.4.3.3 Smart glasses
			8.4.3.4 Autonomous mobile robots
			8.4.3.5 Unmanned aerial vehicles
			8.4.3.6 Autonomous driving vehicles
		8.4.4 Discussion
	8.5 Application Domains
		8.5.1 Scene Mapping
			8.5.1.1 Aircraft inspection
			8.5.1.2 SenseFly eBee classic
		8.5.2 Pedestrian Localization
			8.5.2.1 Indoor localization in large-scale buildings
			8.5.2.2 Precise localization of mobile devices in unknown environments
		8.5.3 Automotive Navigation
			8.5.3.1 Autonomous driving
			8.5.3.2 Smart factory
		8.5.4 Mixed Reality
			8.5.4.1 Virtual cane system for visually impaired individuals
			8.5.4.2 Engineering, construction and maintenance
	8.6 Conclusion
	References
9 Self-Supervised Learning from Web Data for Multimodal Retrieval
	9.1 Introduction
		9.1.1 Annotating Data: A Bottleneck for Training Deep Neural Networks
		9.1.2 Alternatives to Annotated Data
		9.1.3 Exploiting Multimodal Web Data
	9.2 Related Work
		9.2.1 Contributions
	9.3 Multimodal Text-Image Embedding
	9.4 Text Embeddings
	9.5 Benchmarks
		9.5.1 InstaCities1M
		9.5.2 WebVision
		9.5.3 MIRFlickr
	9.6 Retrieval on InstaCities1M and WebVision Datasets
		9.6.1 Experiment Setup
		9.6.2 Results and Conclusions
		9.6.3 Error Analysis
			9.6.3.1 Visual features confusion
			9.6.3.2 Errors from the dataset statistics
			9.6.3.3 Words with different meanings or uses
	9.7 Retrieval in the MIRFlickr Dataset
		9.7.1 Experiment Setup
		9.7.2 Results and Conclusions
	9.8 Comparing the Image and Text Embeddings
		9.8.1 Experiment Setup
		9.8.2 Results and Conclusions
	9.9 Visualizing CNN Activation Maps
	9.10 Visualizing the Learned Semantic Space with t-SNE
		9.10.1 Dimensionality Reduction with t-SNE
		9.10.2 Visualizing Both Image and Text Embeddings
		9.10.3 Showing Images at the Embedding Locations
		9.10.4 Semantic Space Inspection
	9.11 Conclusions
	Acknowledgments
	References
10 3D Urban Scene Reconstruction and Interpretation from Multisensor Imagery
	10.1 Introduction
	10.2 Pose Estimation for Wide-Baseline Image Sets
		10.2.1 Pose Estimation for Wide-Baseline Pairs and Triplets
		10.2.2 Hierarchical Merging of Triplets
		10.2.3 Automatic Determination of Overlap
	10.3 Dense 3D Reconstruction
		10.3.1 Dense Depth Map Generation and Uncertainty Estimation
		10.3.2 3D Uncertainty Propagation and 3D Reconstruction
	10.4 Scene Classification
		10.4.1 Relative Features
			10.4.1.1 Color coherence
			10.4.1.2 Definition of neighborhood
			10.4.1.3 Relative height
			10.4.1.4 Coplanarity of 3D points
		10.4.2 Classification and Results
			10.4.2.1 Post-processing
			10.4.2.2 Results for Bonnland
	10.5 Scene and Building Decomposition
		10.5.1 Scene Decomposition
		10.5.2 Building Decomposition
			10.5.2.1 Ridge extraction
			10.5.2.2 Primitive-based building decomposition
	10.6 Building Modeling
		10.6.1 Primitive Selection and Optimization
		10.6.2 Primitive Assembly
		10.6.3 LoD2 Models
		10.6.4 Detection of Facade Elements
		10.6.5 Shell Model
	10.7 Conclusion and Future Work
	References
11 Decision Fusion of Remote-Sensing Data for Land Cover Classification
	11.1 Introduction
		11.1.1 Review of the Main Data Fusion Methods
			11.1.1.1 Early fusion - fusion at the observation level
			11.1.1.2 Intermediate fusion - fusion at the attribute/feature level
			11.1.1.3 Late fusion - fusion at the decision level
		11.1.2 Discussion and Proposal of a Strategy
	11.2 Proposed Framework
		11.2.1 Fusion Rules
			11.2.1.1 Fuzzy rules
			11.2.1.2 Bayesian combination and majority vote
			11.2.1.3 Margin-based rules
			11.2.1.4 Dempster-Shafer evidence theory
			11.2.1.5 Supervised fusion rules: learning based approaches
		11.2.2 Global Regularization
			11.2.2.1 Model formulation(s)
			11.2.2.2 Optimization
			11.2.2.3 Parameter tuning
	11.3 Use Case #1: Hyperspectral and Very High Resolution Multispectral Imagery for Urban Material Discrimination
		11.3.1 Introduction
		11.3.2 Fusion Process
		11.3.3 Datasets
		11.3.4 Results and Discussion
			11.3.4.1 Source comparison
			11.3.4.2 Decision fusion classification
			11.3.4.3 Regularization
		11.3.5 Conclusion
	11.4 Use Case #2: Urban Footprint Detection
		11.4.1 Introduction
		11.4.2 Proposed Framework: A Two-Step Urban Footprint Detection
			11.4.2.1 Initial classifications
			11.4.2.2 First regularization
			11.4.2.3 Binary classification and fusion
		11.4.3 Data
		11.4.4 Results
			11.4.4.1 Five-class classifications
			11.4.4.2 Urban footprint extraction
		11.4.5 Conclusion
	11.5 Final Outlook and Perspectives
	References
12 Cross-modal Learning by Hallucinating Missing Modalities in RGB-D Vision
	12.1 Introduction
	12.2 Related Work
		12.2.1 Generalized Distillation
		12.2.2 Multimodal Video Action Recognition
	12.3 Generalized Distillation with Multiple Stream Networks
		12.3.1 Cross-stream Multiplier Networks
		12.3.2 Hallucination Stream
		12.3.3 Training Paradigm
	12.4 Experiments
		12.4.1 Datasets
		12.4.2 Pre-processing and Alignment of RGB and Depth Frames
		12.4.3 Hyperparameters and Validation Set
		12.4.4 Ablation Study
			12.4.4.1 Contribution of the cross-stream connections
			12.4.4.2 Contributions of the proposed distillation loss (Eq. (12.5))
			12.4.4.3 Contributions of the proposed training procedure
		12.4.5 Inference with Noisy Depth
		12.4.6 Comparison with Other Methods
		12.4.7 Inverting Modalities - RGB Distillation
	12.5 Conclusions and Future Work
	References
Index
Back Cover




نظرات کاربران