دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش:
نویسندگان: Ivo D. Dinov
سری:
ISBN (شابک) : 9783319723471
ناشر: Springer
سال نشر: 2018
تعداد صفحات: 848
زبان: english
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 32 مگابایت
در صورت تبدیل فایل کتاب Data Science and Predictive Analytics. Biomedical and Health Applications using R به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب علم داده و تجزیه و تحلیل پیش بینی. کاربردهای زیست پزشکی و بهداشتی با استفاده از R نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
در طول دهه گذشته، داده های بزرگ در همه بخش های اقتصادی، رشته های علمی و فعالیت های انسانی در همه جا حاضر شده اند. آنها منجر به پیشرفت های فن آوری قابل توجهی شده اند که بر تمام تجربیات بشری تأثیر گذاشته است. توانایی ما برای مدیریت، درک، بازجویی و تفسیر چنین دادههای بسیار بزرگ، چندمنبعی، ناهمگن، ناقص، چند مقیاسی و نامتجانس با افزایش سریع حجم، پیچیدگی و تکثیر سیل اطلاعات دیجیتال همگام نبوده است. سه دلیل برای این کمبود وجود دارد. اولاً، حجم داده ها بسیار سریعتر از افزایش متناظر قدرت پردازش محاسباتی ما در حال افزایش است (قانون کرایدر > قانون مور). دوم، محدودیت های انضباط سنتی مانع از پیشرفت سریع می شود. سوم، فعالیت های آموزشی و آموزشی ما از روند شتابان پیشرفت های علمی، اطلاعاتی و ارتباطی عقب مانده است. منابع آموزشی دقیق، مواد آموزشی تعاملی، و محیط های آموزشی پویا که از یادگیری فعال علم داده پشتیبانی می کنند، بسیار اندک هستند. کتاب درسی پایههای ریاضی را با نمایشهای ماهرانه و مثالهایی از دادهها، ابزارها، ماژولها و گردشهای کاری که به عنوان ستونهایی برای پل مورد نیاز فوری برای پر کردن شکاف مهارتهای تحلیلی پیشبینیکننده عرضه و تقاضا عمل میکنند، متعادل میکند. هدف این کتاب درسی با افشای فرصت های عظیم ارائه شده توسط سونامی داده های بزرگ، شناسایی شکاف های دانش خاص، موانع آموزشی و کمبودهای آمادگی نیروی کار است. به طور خاص، این برنامه بر توسعه یک برنامه درسی فرا رشتهای با ادغام روشهای محاسباتی مدرن، تکنیکهای پیشرفته علم داده، کاربردهای نوآورانه زیستپزشکی و تجزیه و تحلیل سلامت تاثیرگذار تمرکز دارد. محتوای این کتاب درسی سطح فارغ التحصیل شکاف قابل توجهی را در ادغام مفاهیم مهندسی مدرن، الگوریتم های محاسباتی، بهینه سازی ریاضی، محاسبات آماری و استنتاج زیست پزشکی پر می کند. تکنیک های تجزیه و تحلیل کلان داده ها و روش های علمی پیش بینی کننده دانش فرا رشته ای گسترده ای را می طلبد، برای طیف بسیار گسترده ای از خوانندگان/ فراگیران جذاب است و فرصت های باورنکردنی را برای مشارکت در سراسر آکادمی، صنعت، سازمان های نظارتی و مالی فراهم می کند. دو مثال زیر نیاز قدرتمند به دانش علمی، تواناییهای محاسباتی، تخصص بین رشتهای و فناوریهای مدرن لازم برای دستیابی به نتایج مطلوب (بهبود سلامت انسان و بهینهسازی بازگشت سرمایه در آینده) را نشان میدهد. این تنها توسط تیمهای آموزشدیده مناسب محقق میشود که میتوانند سیستمهای پشتیبانی تصمیم قوی را با استفاده از تکنیکهای مدرن و پروتکلهای مؤثر سرتاسر، مانند آنچه در این کتاب درسی توضیح داده شده، توسعه دهند. • یک متخصص مغز و اعصاب سالمندان در حال معاینه بیمار است که از عدم تعادل راه رفتن و بی ثباتی وضعیت بدن شکایت دارد. برای تعیین اینکه آیا بیمار ممکن است از بیماری پارکینسون رنج ببرد، پزشک داده های بالینی، شناختی، فنوتیپی، تصویربرداری و ژنتیکی (داده های بزرگ) را به دست می آورد. اکثر کلینیک ها و مراکز مراقبت های بهداشتی مجهز به تیم های ماهر تجزیه و تحلیل داده ها نیستند که بتوانند چنین مجموعه داده های پیچیده ای را مورد بحث، هماهنگ و تفسیر قرار دهند. زبان آموزی که یک دوره تحصیلی را با استفاده از این کتاب درسی تکمیل می کند، شایستگی و توانایی مدیریت داده ها، تولید پروتکلی برای استخراج نشانگرهای زیستی و ارائه یک سیستم پشتیبانی تصمیم گیری عملی را خواهد داشت. نتایج این پروتکل به پزشک کمک میکند تا کل مجموعه دادههای بیمار را درک کند و به ایجاد یک تشخیص کلینیکی مبتنی بر شواهد، مبتنی بر دادهها، کمک کند. • برای بهبود بازده سرمایه گذاری برای سهامداران خود، یک تولید کننده خدمات بهداشتی باید تقاضا برای محصول خود را با توجه به داده های محیطی، جمعیتی، اقتصادی و زیست اجتماعی (داده های بزرگ) پیش بینی کند. تیم تجزیه و تحلیل داده سازمان وظیفه دارد پروتکلی را ایجاد کند که این عناصر داده ناهمگن را شناسایی، تجمیع، هماهنگ، مدلسازی و تجزیه و تحلیل کند تا یک پیشبینی روند تولید کند. این سیستم باید یک پیشبینی خودکار، تطبیقی، مقیاسپذیر و قابل اعتماد از سرمایهگذاری بهینه، بهعنوان مثال، تخصیص تحقیق و توسعه، ارائه دهد که نتیجه شرکت را به حداکثر میرساند. خواننده ای که یک دوره مطالعه را با استفاده از این کتاب درسی کامل می کند، می تواند داده های ساختاریافته و بدون ساختار مشاهده شده را دریافت کند، داده ها را به صورت ریاضی به عنوان یک شی قابل محاسبه نمایش دهد، تکنیک های پیش بینی مبتنی بر مدل و بدون مدل مناسب را اعمال کند. نتایج این تکنیک ها ممکن است برای پیش بینی رابطه مورد انتظار بین سرمایه گذاری شرکت، عرضه محصول، تقاضای عمومی مراقبت های بهداشتی (ارائه دهندگان و بیماران) و تخمین بازده سرمایه گذاری های اولیه مورد استفاده قرار گیرد.
Over the past decade, Big Data have become ubiquitous in all economic sectors, scientific disciplines, and human activities. They have led to striking technological advances, affecting all human experiences. Our ability to manage, understand, interrogate, and interpret such extremely large, multisource, heterogeneous, incomplete, multiscale, and incongruent data has not kept pace with the rapid increase of the volume, complexity and proliferation of the deluge of digital information. There are three reasons for this shortfall. First, the volume of data is increasing much faster than the corresponding rise of our computational processing power (Kryder’s law > Moore’s law). Second, traditional discipline-bounds inhibit expeditious progress. Third, our education and training activities have fallen behind the accelerated trend of scientific, information, and communication advances. There are very few rigorous instructional resources, interactive learning materials, and dynamic training environments that support active data science learning. The textbook balances the mathematical foundations with dexterous demonstrations and examples of data, tools, modules and workflows that serve as pillars for the urgently needed bridge to close that supply and demand predictive analytic skills gap. Exposing the enormous opportunities presented by the tsunami of Big data, this textbook aims to identify specific knowledge gaps, educational barriers, and workforce readiness deficiencies. Specifically, it focuses on the development of a transdisciplinary curriculum integrating modern computational methods, advanced data science techniques, innovative biomedical applications, and impactful health analytics. The content of this graduate-level textbook fills a substantial gap in integrating modern engineering concepts, computational algorithms, mathematical optimization, statistical computing and biomedical inference. Big data analytic techniques and predictive scientific methods demand broad transdisciplinary knowledge, appeal to an extremely wide spectrum of readers/learners, and provide incredible opportunities for engagement throughout the academy, industry, regulatory and funding agencies. The two examples below demonstrate the powerful need for scientific knowledge, computational abilities, interdisciplinary expertise, and modern technologies necessary to achieve desired outcomes (improving human health and optimizing future return on investment). This can only be achieved by appropriately trained teams of researchers who can develop robust decision support systems using modern techniques and effective end-to-end protocols, like the ones described in this textbook. • A geriatric neurologist is examining a patient complaining of gait imbalance and posture instability. To determine if the patient may suffer from Parkinson’s disease, the physician acquires clinical, cognitive, phenotypic, imaging, and genetics data (Big Data). Most clinics and healthcare centers are not equipped with skilled data analytic teams that can wrangle, harmonize and interpret such complex datasets. A learner that completes a course of study using this textbook will have the competency and ability to manage the data, generate a protocol for deriving biomarkers, and provide an actionable decision support system. The results of this protocol will help the physician understand the entire patient dataset and assist in making a holistic evidence-based, data-driven, clinical diagnosis. • To improve the return on investment for their shareholders, a healthcare manufacturer needs to forecast the demand for their product subject to environmental, demographic, economic, and bio-social sentiment data (Big Data). The organization’s data-analytics team is tasked with developing a protocol that identifies, aggregates, harmonizes, models and analyzes these heterogeneous data elements to generate a trend forecast. This system needs to provide an automated, adaptive, scalable, and reliable prediction of the optimal investment, e.g., R&D allocation, that maximizes the company’s bottom line. A reader that complete a course of study using this textbook will be able to ingest the observed structured and unstructured data, mathematically represent the data as a computable object, apply appropriate model-based and model-free prediction techniques. The results of these techniques may be used to forecast the expected relation between the company’s investment, product supply, general demand of healthcare (providers and patients), and estimate the return on initial investments.
Foreword Preface Genesis Purpose Limitations/Prerequisites Scope of the Book Acknowledgements DSPA Application and Use Disclaimer Biomedical, Biosocial, Environmental, and Health Disclaimer Notations Contents Chapter 1: Motivation 1.1 DSPA Mission and Objectives 1.2 Examples of Driving Motivational Problems and Challenges 1.2.1 Alzheimer´s Disease 1.2.2 Parkinson´s Disease 1.2.3 Drug and Substance Use 1.2.4 Amyotrophic Lateral Sclerosis 1.2.5 Normal Brain Visualization 1.2.6 Neurodegeneration 1.2.7 Genetic Forensics: 2013-2016 Ebola Outbreak 1.2.8 Next Generation Sequence (NGS) Analysis 1.2.9 Neuroimaging-Genetics 1.3 Common Characteristics of Big (Biomedical and Health) Data 1.4 Data Science 1.5 Predictive Analytics 1.6 High-Throughput Big Data Analytics 1.7 Examples of Data Repositories, Archives, and Services 1.8 DSPA Expectations Chapter 2: Foundations of R 2.1 Why Use R? 2.2 Getting Started 2.2.1 Install Basic Shell-Based R 2.2.2 GUI Based R Invocation (RStudio) 2.2.3 RStudio GUI Layout 2.2.4 Some Notes 2.3 Help 2.4 Simple Wide-to-Long Data format Translation 2.5 Data Generation 2.6 Input/Output (I/O) 2.7 Slicing and Extracting Data 2.8 Variable Conversion 2.9 Variable Information 2.10 Data Selection and Manipulation 2.11 Math Functions 2.12 Matrix Operations 2.13 Advanced Data Processing 2.14 Strings 2.15 Plotting 2.16 QQ Normal Probability Plot 2.17 Low-Level Plotting Commands 2.18 Graphics Parameters 2.19 Optimization and model Fitting 2.20 Statistics 2.21 Distributions 2.21.1 Programming 2.22 Data Simulation Primer 2.23 Appendix 2.23.1 HTML SOCR Data Import 2.23.2 R Debugging Example 2.24 Assignments: 2. R Foundations 2.24.1 Confirm that You Have Installed R/RStudio 2.24.2 Long-to-Wide Data Format Translation 2.24.3 Data Frames 2.24.4 Data Stratification 2.24.5 Simulation 2.24.6 Programming References Chapter 3: Managing Data in R 3.1 Saving and Loading R Data Structures 3.2 Importing and Saving Data from CSV Files 3.3 Exploring the Structure of Data 3.4 Exploring Numeric Variables 3.5 Measuring the Central Tendency: Mean, Median, Mode 3.6 Measuring Spread: Quartiles and the Five-Number Summary 3.7 Visualizing Numeric Variables: Boxplots 3.8 Visualizing Numeric Variables: Histograms 3.9 Understanding Numeric Data: Uniform and Normal Distributions 3.10 Measuring Spread: Variance and Standard Deviation 3.11 Exploring Categorical Variables 3.12 Exploring Relationships Between Variables 3.13 Missing Data 3.13.1 Simulate Some Real Multivariate Data 3.13.2 TBI Data Example 3.13.3 Imputation via Expectation-Maximization Types of Missing Data General Idea of EM Algorithm EM-Based Imputation A Simple Manual Implementation of EM-Based Imputation Plotting Complete and Imputed Data Validation of EM-Imputation Using the Amelia R Package Comparison Density Plots 3.14 Parsing Webpages and Visualizing Tabular HTML Data 3.15 Cohort-Rebalancing (for Imbalanced Groups) 3.16 Appendix 3.16.1 Importing Data from SQL Databases 3.16.2 R Code Fragments 3.17 Assignments: 3. Managing Data in R 3.17.1 Import, Plot, Summarize and Save Data 3.17.2 Explore some Bivariate Relations in the Data 3.17.3 Missing Data 3.17.4 Surface Plots 3.17.5 Unbalanced Designs 3.17.6 Aggregate Analysis References Chapter 4: Data Visualization 4.1 Common Questions 4.2 Classification of Visualization Methods 4.3 Composition 4.3.1 Histograms and Density Plots 4.3.2 Pie Chart 4.3.3 Heat Map 4.4 Comparison 4.4.1 Paired Scatter Plots 4.4.2 Jitter Plot 4.4.3 Bar Plots 4.4.4 Trees and Graphs 4.4.5 Correlation Plots 4.5 Relationships 4.5.1 Line Plots Using ggplot 4.5.2 Density Plots 4.5.3 Distributions 4.5.4 2D Kernel Density and 3D Surface Plots 4.5.5 Multiple 2D Image Surface Plots 4.5.6 3D and 4D Visualizations 4.6 Appendix 4.6.1 Hands-on Activity (Health Behavior Risks) 4.6.2 Additional ggplot Examples Housing Price Data Modeling the Home Price Index Data (Fig. 4.48) Map of the Neighborhoods of Los Angeles (LA) Latin Letter Frequency in Different Languages 4.7 Assignments 4: Data Visualization 4.7.1 Common Plots 4.7.2 Trees and Graphs 4.7.3 Exploratory Data Analytics (EDA) References Chapter 5: Linear Algebra and Matrix Computing 5.1 Matrices (Second Order Tensors) 5.1.1 Create Matrices 5.1.2 Adding Columns and Rows 5.2 Matrix Subscripts 5.3 Matrix Operations 5.3.1 Addition 5.3.2 Subtraction 5.3.3 Multiplication Elementwise Multiplication Matrix Multiplication 5.3.4 Element-wise Division 5.3.5 Transpose 5.3.6 Multiplicative Inverse 5.4 Matrix Algebra Notation 5.4.1 Linear Models 5.4.2 Solving Systems of Equations 5.4.3 The Identity Matrix 5.5 Scalars, Vectors and Matrices 5.5.1 Sample Statistics (Mean, Variance) Mean Variance Applications of Matrix Algebra: Linear Modeling Finding Function Extrema (Min/Max) Using Calculus 5.5.2 Least Square Estimation The R lm Function 5.6 Eigenvalues and Eigenvectors 5.7 Other Important Functions 5.8 Matrix Notation (Another View) 5.9 Multivariate Linear Regression 5.10 Sample Covariance Matrix 5.11 Assignments: 5. Linear Algebra and Matrix Computing 5.11.1 How Is Matrix Multiplication Defined? 5.11.2 Scalar Versus Matrix Multiplication 5.11.3 Matrix Equations 5.11.4 Least Square Estimation 5.11.5 Matrix Manipulation 5.11.6 Matrix Transpose 5.11.7 Sample Statistics 5.11.8 Least Square Estimation 5.11.9 Eigenvalues and Eigenvectors References Chapter 6: Dimensionality Reduction 6.1 Example: Reducing 2D to 1D 6.2 Matrix Rotations 6.3 Notation 6.4 Summary (PCA vs. ICA vs. FA) 6.5 Principal Component Analysis (PCA) 6.5.1 Principal Components 6.6 Independent Component Analysis (ICA) 6.7 Factor Analysis (FA) 6.8 Singular Value Decomposition (SVD) 6.9 SVD Summary 6.10 Case Study for Dimension Reduction (Parkinson´s Disease) 6.11 Assignments: 6. Dimensionality Reduction 6.11.1 Parkinson´s Disease Example 6.11.2 Allometric Relations in Plants Example Load Data Dimensionality Reduction References Chapter 7: Lazy Learning: Classification Using Nearest Neighbors 7.1 Motivation 7.2 The kNN Algorithm Overview 7.2.1 Distance Function and Dummy Coding 7.2.2 Ways to Determine k 7.2.3 Rescaling of the Features 7.2.4 Rescaling Formulas 7.3 Case Study 7.3.1 Step 1: Collecting Data 7.3.2 Step 2: Exploring and Preparing the Data 7.3.3 Normalizing Data 7.3.4 Data Preparation: Creating Training and Testing Datasets 7.3.5 Step 3: Training a Model On the Data 7.3.6 Step 4: Evaluating Model Performance 7.3.7 Step 5: Improving Model Performance 7.3.8 Testing Alternative Values of k 7.3.9 Quantitative Assessment (Tables 7.2 and 7.3) 7.4 Assignments: 7. Lazy Learning: Classification Using Nearest Neighbors 7.4.1 Traumatic Brain Injury (TBI) 7.4.2 Parkinson´s Disease 7.4.3 KNN Classification in a High Dimensional Space 7.4.4 KNN Classification in a Lower Dimensional Space References Chapter 8: Probabilistic Learning: Classification Using Naive Bayes 8.1 Overview of the Naive Bayes Algorithm 8.2 Assumptions 8.3 Bayes Formula 8.4 The Laplace Estimator 8.5 Case Study: Head and Neck Cancer Medication 8.5.1 Step 1: Collecting Data 8.5.2 Step 2: Exploring and Preparing the Data Data Preparation: Processing Text Data for Analysis Data Preparation: Creating Training and Test Datasets Visualizing Text Data: Word Clouds Data Preparation: Creating Indicator Features for Frequent Words 8.5.3 Step 3: Training a Model on the Data 8.5.4 Step 4: Evaluating Model Performance 8.5.5 Step 5: Improving Model Performance 8.5.6 Step 6: Compare Naive Bayesian against LDA 8.6 Practice Problem 8.7 Assignments 8: Probabilistic Learning: Classification Using Naive Bayes 8.7.1 Explain These Two Concepts 8.7.2 Analyzing Textual Data References Chapter 9: Decision Tree Divide and Conquer Classification 9.1 Motivation 9.2 Hands-on Example: Iris Data 9.3 Decision Tree Overview 9.3.1 Divide and Conquer 9.3.2 Entropy 9.3.3 Misclassification Error and Gini Index 9.3.4 C5.0 Decision Tree Algorithm 9.3.5 Pruning the Decision Tree 9.4 Case Study 1: Quality of Life and Chronic Disease 9.4.1 Step 1: Collecting Data 9.4.2 Step 2: Exploring and Preparing the Data Data Preparation: Creating Random Training and Test Datasets 9.4.3 Step 3: Training a Model On the Data 9.4.4 Step 4: Evaluating Model Performance 9.4.5 Step 5: Trial Option 9.4.6 Loading the Misclassification Error Matrix 9.4.7 Parameter Tuning 9.5 Compare Different Impurity Indices 9.6 Classification Rules 9.6.1 Separate and Conquer 9.6.2 The One Rule Algorithm 9.6.3 The RIPPER Algorithm 9.7 Case Study 2: QoL in Chronic Disease (Take 2) 9.7.1 Step 3: Training a Model on the Data 9.7.2 Step 4: Evaluating Model Performance 9.7.3 Step 5: Alternative Model1 9.7.4 Step 5: Alternative Model2 9.8 Practice Problem 9.9 Assignments 9: Decision Tree Divide and Conquer Classification 9.9.1 Explain These Concepts 9.9.2 Decision Tree Partitioning References Chapter 10: Forecasting Numeric Data Using Regression Models 10.1 Understanding Regression 10.1.1 Simple Linear Regression 10.2 Ordinary Least Squares Estimation 10.2.1 Model Assumptions 10.2.2 Correlations 10.2.3 Multiple Linear Regression 10.3 Case Study 1: Baseball Players 10.3.1 Step 1: Collecting Data 10.3.2 Step 2: Exploring and Preparing the Data 10.3.3 Exploring Relationships Among Features: The Correlation Matrix 10.3.4 Visualizing Relationships Among Features: The Scatterplot Matrix 10.3.5 Step 3: Training a Model on the Data 10.3.6 Step 4: Evaluating Model Performance 10.4 Step 5: Improving Model Performance 10.4.1 Model Specification: Adding Non-linear Relationships 10.4.2 Transformation: Converting a Numeric Variable to a Binary Indicator 10.4.3 Model Specification: Adding Interaction Effects 10.5 Understanding Regression Trees and Model Trees 10.5.1 Adding Regression to Trees 10.6 Case Study 2: Baseball Players (Take 2) 10.6.1 Step 2: Exploring and Preparing the Data 10.6.2 Step 3: Training a Model On the Data 10.6.3 Visualizing Decision Trees 10.6.4 Step 4: Evaluating Model Performance 10.6.5 Measuring Performance with Mean Absolute Error 10.6.6 Step 5: Improving Model Performance 10.7 Practice Problem: Heart Attack Data 10.8 Assignments: 10. Forecasting Numeric Data Using Regression Models References Chapter 11: Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines 11.1 Understanding Neural Networks 11.1.1 From Biological to Artificial Neurons 11.1.2 Activation Functions 11.1.3 Network Topology 11.1.4 The Direction of Information Travel 11.1.5 The Number of Nodes in Each Layer 11.1.6 Training Neural Networks with Backpropagation 11.2 Case Study 1: Google Trends and the Stock Market: Regression 11.2.1 Step 1: Collecting Data Variables 11.2.2 Step 2: Exploring and Preparing the Data 11.2.3 Step 3: Training a Model on the Data 11.2.4 Step 4: Evaluating Model Performance 11.2.5 Step 5: Improving Model Performance 11.2.6 Step 6: Adding Additional Layers 11.3 Simple NN Demo: Learning to Compute 11.4 Case Study 2: Google Trends and the Stock Market - Classification 11.5 Support Vector Machines (SVM) 11.5.1 Classification with Hyperplanes Finding the Maximum Margin Linearly Separable Data Non-linearly Separable Data Using Kernels for Non-linear Spaces 11.6 Case Study 3: Optical Character Recognition (OCR) 11.6.1 Step 1: Prepare and Explore the Data 11.6.2 Step 2: Training an SVM Model 11.6.3 Step 3: Evaluating Model Performance 11.6.4 Step 4: Improving Model Performance 11.7 Case Study 4: Iris Flowers 11.7.1 Step 1: Collecting Data 11.7.2 Step 2: Exploring and Preparing the Data 11.7.3 Step 3: Training a Model on the Data 11.7.4 Step 4: Evaluating Model Performance 11.7.5 Step 5: RBF Kernel Function 11.7.6 Parameter Tuning 11.7.7 Improving the Performance of Gaussian Kernels 11.8 Practice 11.8.1 Problem 1 Google Trends and the Stock Market 11.8.2 Problem 2: Quality of Life and Chronic Disease 11.9 Appendix 11.10 Assignments: 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines 11.10.1 Learn and Predict a Power-Function 11.10.2 Pediatric Schizophrenia Study References Chapter 12: Apriori Association Rules Learning 12.1 Association Rules 12.2 The Apriori Algorithm for Association Rule Learning 12.3 Measuring Rule Importance by Using Support and Confidence 12.4 Building a Set of Rules with the Apriori Principle 12.5 A Toy Example 12.6 Case Study 1: Head and Neck Cancer Medications 12.6.1 Step 1: Collecting Data 12.6.2 Step 2: Exploring and Preparing the Data Visualizing Item Support: Item Frequency Plots Visualizing Transaction Data: Plotting the Sparse Matrix 12.6.3 Step 3: Training a Model on the Data 12.6.4 Step 4: Evaluating Model Performance 12.6.5 Step 5: Improving Model Performance Sorting the Set of Association Rules Taking Subsets of Association Rules Saving Association Rules to a File or Data Frame 12.7 Practice Problems: Groceries 12.8 Summary 12.9 Assignments: 12. Apriori Association Rules Learning References Chapter 13: k-Means Clustering 13.1 Clustering as a Machine Learning Task 13.2 Silhouette Plots 13.3 The k-Means Clustering Algorithm 13.3.1 Using Distance to Assign and Update Clusters 13.3.2 Choosing the Appropriate Number of Clusters 13.4 Case Study 1: Divorce and Consequences on Young Adults 13.4.1 Step 1: Collecting Data Variables 13.4.2 Step 2: Exploring and Preparing the Data 13.4.3 Step 3: Training a Model on the Data 13.4.4 Step 4: Evaluating Model Performance 13.4.5 Step 5: Usage of Cluster Information 13.5 Model Improvement 13.5.1 Tuning the Parameter k 13.6 Case Study 2: Pediatric Trauma 13.6.1 Step 1: Collecting Data 13.6.2 Step 2: Exploring and Preparing the Data 13.6.3 Step 3: Training a Model on the Data 13.6.4 Step 4: Evaluating Model Performance 13.6.5 Practice Problem: Youth Development 13.7 Hierarchical Clustering 13.8 Gaussian Mixture Models 13.9 Summary 13.10 Assignments: 13. k-Means Clustering References Chapter 14: Model Performance Assessment 14.1 Measuring the Performance of Classification Methods 14.2 Evaluation Strategies 14.2.1 Binary Outcomes 14.2.2 Confusion Matrices 14.2.3 Other Measures of Performance Beyond Accuracy 14.2.4 The Kappa (κ) Statistic Summary of the Kappa Score for Calculating Prediction Accuracy 14.2.5 Computation of Observed Accuracy and Expected Accuracy 14.2.6 Sensitivity and Specificity 14.2.7 Precision and Recall 14.2.8 The F-Measure 14.3 Visualizing Performance Tradeoffs (ROC Curve) 14.4 Estimating Future Performance (Internal Statistical Validation) 14.4.1 The Holdout Method 14.4.2 Cross-Validation 14.4.3 Bootstrap Sampling 14.5 Assignment: 14. Evaluation of Model Performance References Chapter 15: Improving Model Performance 15.1 Improving Model Performance by Parameter Tuning 15.2 Using caret for Automated Parameter Tuning 15.2.1 Customizing the Tuning Process 15.2.2 Improving Model Performance with Meta-learning 15.2.3 Bagging 15.2.4 Boosting 15.2.5 Random Forests Training Random Forests Evaluating Random Forest Performance 15.2.6 Adaptive Boosting 15.3 Assignment: 15. Improving Model Performance 15.3.1 Model Improvement Case Study References Chapter 16: Specialized Machine Learning Topics 16.1 Working with Specialized Data and Databases 16.1.1 Data Format Conversion 16.1.2 Querying Data in SQL Databases 16.1.3 Real Random Number Generation 16.1.4 Downloading the Complete Text of Web Pages 16.1.5 Reading and Writing XML with the XML Package 16.1.6 Web-Page Data Scraping 16.1.7 Parsing JSON from Web APIs 16.1.8 Reading and Writing Microsoft Excel Spreadsheets Using XLSX 16.2 Working with Domain-Specific Data 16.2.1 Working with Bioinformatics Data 16.2.2 Visualizing Network Data 16.3 Data Streaming 16.3.1 Definition 16.3.2 The stream Package 16.3.3 Synthetic Example: Random Gaussian Stream k-Means Clustering 16.3.4 Sources of Data Streams Static Structure Streams Concept Drift Streams Real Data Streams 16.3.5 Printing, Plotting and Saving Streams 16.3.6 Stream Animation 16.3.7 Case-Study: SOCR Knee Pain Data 16.3.8 Data Stream Clustering and Classification (DSC) 16.3.9 Evaluation of Data Stream Clustering 16.4 Optimization and Improving the Computational Performance 16.4.1 Generalizing Tabular Data Structures with dplyr 16.4.2 Making Data Frames Faster with Data.Table 16.4.3 Creating Disk-Based Data Frames with ff 16.4.4 Using Massive Matrices with bigmemory 16.5 Parallel Computing 16.5.1 Measuring Execution Time 16.5.2 Parallel Processing with Multiple Cores 16.5.3 Parallelization Using foreach and doParallel 16.5.4 GPU Computing 16.6 Deploying Optimized Learning Algorithms 16.6.1 Building Bigger Regression Models with biglm 16.6.2 Growing Bigger and Faster Random Forests with bigrf 16.6.3 Training and Evaluation Models in Parallel with caret 16.7 Practice Problem 16.8 Assignment: 16. Specialized Machine Learning Topics 16.8.1 Working with Website Data 16.8.2 Network Data and Visualization 16.8.3 Data Conversion and Parallel Computing References Chapter 17: Variable/Feature Selection 17.1 Feature Selection Methods 17.1.1 Filtering Techniques 17.1.2 Wrapper Methods 17.1.3 Embedded Techniques 17.2 Case Study: ALS 17.2.1 Step 1: Collecting Data 17.2.2 Step 2: Exploring and Preparing the Data 17.2.3 Step 3: Training a Model on the Data 17.2.4 Step 4: Evaluating Model Performance Comparing with RFE Comparing with Stepwise Feature Selection 17.3 Practice Problem 17.4 Assignment: 17. Variable/Feature Selection 17.4.1 Wrapper Feature Selection 17.4.2 Use the PPMI Dataset References Chapter 18: Regularized Linear Modeling and Controlled Variable Selection 18.1 Questions 18.2 Matrix Notation 18.3 Regularized Linear Modeling 18.3.1 Ridge Regression 18.3.2 Least Absolute Shrinkage and Selection Operator (LASSO) Regression 18.3.3 Predictor Standardization 18.3.4 Estimation Goals 18.4 Linear Regression 18.4.1 Drawbacks of Linear Regression 18.4.2 Assessing Prediction Accuracy 18.4.3 Estimating the Prediction Error 18.4.4 Improving the Prediction Accuracy 18.4.5 Variable Selection 18.5 Regularization Framework 18.5.1 Role of the Penalty Term 18.5.2 Role of the Regularization Parameter 18.5.3 LASSO 18.5.4 General Regularization Framework 18.6 Implementation of Regularization 18.6.1 Example: Neuroimaging-Genetics Study of Parkinson´s Disease Dataset 18.6.2 Computational Complexity 18.6.3 LASSO and Ridge Solution Paths 18.6.4 Choice of the Regularization Parameter 18.6.5 Cross Validation Motivation 18.6.6 n-Fold Cross Validation 18.6.7 LASSO 10-Fold Cross Validation 18.6.8 Stepwise OLS (Ordinary Least Squares) 18.6.9 Final Models 18.6.10 Model Performance 18.6.11 Comparing Selected Features 18.6.12 Summary 18.7 Knock-off Filtering: Simulated Example 18.7.1 Notes 18.8 PD Neuroimaging-Genetics Case-Study 18.8.1 Fetching, Cleaning and Preparing the Data 18.8.2 Preparing the Response Vector 18.8.3 False Discovery Rate (FDR) Graphical Interpretation of the Benjamini-Hochberg (BH) Method FDR Adjusting the p-Values 18.8.4 Running the Knockoff Filter 18.9 Assignment: 18. Regularized Linear Modeling and Knockoff Filtering References Chapter 19: Big Longitudinal Data Analysis 19.1 Time Series Analysis 19.1.1 Step 1: Plot Time Series 19.1.2 Step 2: Find Proper Parameter Values for ARIMA Model 19.1.3 Check the Differencing Parameter 19.1.4 Identifying the AR and MA Parameters 19.1.5 Step 3: Build an ARIMA Model 19.1.6 Step 4: Forecasting with ARIMA Model 19.2 Structural Equation Modeling (SEM)-Latent Variables 19.2.1 Foundations of SEM 19.2.2 SEM Components 19.2.3 Case Study - Parkinson´s Disease (PD) Step 1 - Collecting Data Step 2 - Exploring and Preparing the Data Step 3 - Fitting a Model on the Data 19.2.4 Outputs of Lavaan SEM 19.3 Longitudinal Data Analysis-Linear Mixed Models 19.3.1 Mean Trend 19.3.2 Modeling the Correlation 19.4 GLMM/GEE Longitudinal Data Analysis 19.4.1 GEE Versus GLMM 19.5 Assignment: 19. Big Longitudinal Data Analysis 19.5.1 Imaging Data 19.5.2 Time Series Analysis 19.5.3 Latent Variables Model References Chapter 20: Natural Language Processing/Text Mining 20.1 A Simple NLP/TM Example 20.1.1 Define and Load the Unstructured-Text Documents 20.1.2 Create a New VCorpus Object 20.1.3 To-Lower Case Transformation 20.1.4 Text Pre-processing Remove Stopwords Remove Punctuation Stemming: Removal of Plurals and Action Suffixes 20.1.5 Bags of Words 20.1.6 Document Term Matrix 20.2 Case-Study: Job Ranking 20.2.1 Step 1: Make a VCorpus Object 20.2.2 Step 2: Clean the VCorpus Object 20.2.3 Step 3: Build the Document Term Matrix 20.2.4 Area Under the ROC Curve 20.3 TF-IDF 20.3.1 Term Frequency (TF) 20.3.2 Inverse Document Frequency (IDF) 20.3.3 TF-IDF 20.4 Cosine Similarity 20.5 Sentiment Analysis 20.5.1 Data Preprocessing 20.5.2 NLP/TM Analytics 20.5.3 Prediction Optimization 20.6 Assignment: 20. Natural Language Processing/Text Mining 20.6.1 Mining Twitter Data 20.6.2 Mining Cancer Clinical Notes References Chapter 21: Prediction and Internal Statistical Cross Validation 21.1 Forecasting Types and Assessment Approaches 21.2 Overfitting 21.2.1 Example (US Presidential Elections) 21.2.2 Example (Google Flu Trends) 21.2.3 Example (Autism) 21.3 Internal Statistical Cross-Validation is an Iterative Process 21.4 Example (Linear Regression) 21.4.1 Cross-Validation Methods 21.4.2 Exhaustive Cross-Validation 21.4.3 Non-Exhaustive Cross-Validation 21.5 Case-Studies 21.5.1 Example 1: Prediction of Parkinson´s Disease Using Adaptive Boosting (AdaBoost) 21.5.2 Example 2: Sleep Dataset 21.5.3 Example 3: Model-Based (Linear Regression) Prediction Using the Attitude Dataset 21.5.4 Example 4: Parkinson´s Data (ppmi_data) 21.6 Summary of CV output 21.7 Alternative Predictor Functions 21.7.1 Logistic Regression 21.7.2 Quadratic Discriminant Analysis (QDA) 21.7.3 Foundation of LDA and QDA for Prediction, Dimensionality Reduction, and Forecasting LDA (Linear Discriminant Analysis) QDA (Quadratic Discriminant Analysis) 21.7.4 Neural Networks 21.7.5 SVM 21.7.6 k-Nearest Neighbors Algorithm (k-NN) 21.7.7 k-Means Clustering (k-MC) 21.7.8 Spectral Clustering Iris Petal Data Spirals Data Income Data 21.8 Compare the Results 21.9 Assignment: 21. Prediction and Internal Statistical Cross-Validation References Chapter 22: Function Optimization 22.1 Free (Unconstrained) Optimization 22.1.1 Example 1: Minimizing a Univariate Function (Inverse-CDF) 22.1.2 Example 2: Minimizing a Bivariate Function 22.1.3 Example 3: Using Simulated Annealing to Find the Maximum of an Oscillatory Function 22.2 Constrained Optimization 22.2.1 Equality Constraints 22.2.2 Lagrange Multipliers 22.2.3 Inequality Constrained Optimization Linear Programming (LP) Mixed Integer Linear Programming (MILP) 22.2.4 Quadratic Programming (QP) 22.3 General Non-linear Optimization 22.3.1 Dual Problem Optimization Motivation Example 1: Linear Example Example 2: Quadratic Example Example 3: More Complex Non-linear Optimization Example 4: Another Linear Example 22.4 Manual Versus Automated Lagrange Multiplier Optimization 22.5 Data Denoising 22.6 Assignment: 22. Function Optimization 22.6.1 Unconstrained Optimization 22.6.2 Linear Programming (LP) 22.6.3 Mixed Integer Linear Programming (MILP) 22.6.4 Quadratic Programming (QP) 22.6.5 Complex Non-linear Optimization 22.6.6 Data Denoising References Chapter 23: Deep Learning, Neural Networks 23.1 Deep Learning Training 23.1.1 Perceptrons 23.2 Biological Relevance 23.3 Simple Neural Net Examples 23.3.1 Exclusive OR (XOR) Operator 23.3.2 NAND Operator 23.3.3 Complex Networks Designed Using Simple Building Blocks 23.4 Classification 23.4.1 Sonar Data Example 23.4.2 MXNet Notes 23.5 Case-Studies 23.5.1 ALS Regression Example 23.5.2 Spirals 2D Data 23.5.3 IBS Study 23.5.4 Country QoL Ranking Data 23.5.5 Handwritten Digits Classification Configuring the Neural Network Training Forecasting Examining the Network Structure Using LeNet 23.6 Classifying Real-World Images 23.6.1 Load the Pre-trained Model 23.6.2 Load, Preprocess and Classify New Images - US Weather Pattern 23.6.3 Lake Mapourika, New Zealand 23.6.4 Beach Image 23.6.5 Volcano 23.6.6 Brain Surface 23.6.7 Face Mask 23.7 Assignment: 23. Deep Learning, Neural Networks 23.7.1 Deep Learning Classification 23.7.2 Deep Learning Regression 23.7.3 Image Classification References Summary Glossary