دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش:
نویسندگان: S. Haines
سری:
ISBN (شابک) : 9781484274514, 9781484274521
ناشر:
سال نشر: 2022
تعداد صفحات: 592
زبان: English
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 5 Mb
در صورت تبدیل فایل کتاب Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب مهندسی مدرن داده با اسپارک آپاچی: راهنمای عملی برای ساخت برنامههای جریان حیاتی ماموریت نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
Table of Contents About the Author About the Technical Reviewer Acknowledgments Introduction Chapter 1: Introduction to Modern Data Engineering The Emergence of Data Engineering Before the Cloud Automation as a Catalyst The Cloud Age The Public Cloud The Origins of the Data Engineer The Many Flavors of Databases OLTP and the OLAP Database The Trouble with Transactions Analytical Queries No Schema. No Problem. The NoSQL Database The NewSQL Database Thinking about Tradeoffs Cloud Storage Data Warehouses and the Data Lake The Data Warehouse The ETL Job The Data Lake The Data Pipeline Architecture The Data Pipeline Workflow Orchestration The Data Catalog Data Lineage Stream Processing Interprocess Communication Network Queues From Distributed Queues to Repayable Message Queues Fault-Tolerance and Reliability Kafka’s Distributed Architecture Kafka Records Brokers Why Stream Processing Matters Summary Chapter 2: Getting Started with Apache Spark The Apache Spark Architecture The MapReduce Paradigm Mappers Durable and Safe Acyclic Execution Reducers From Data Isolation to Distributed Datasets The Spark Programming Model Did You Never Learn to Share? The Resilient Distributed Data Model The Spark Application Architecture The Role of the Driver Program The Role of the Cluster Manager Bring Your Own Cluster The Role of the Spark Executors The Modular Spark Ecosystem The Core Spark Modules From RDDs to DataFrames and Datasets Getting Up and Running with Spark Installing Spark Downloading Java JDK Downloading Scala Downloading Spark Taking Spark for a Test Ride The Spark Shell Exercise 2-1: Revisiting the Business Intelligence Use Case Defining the Problem Solving the Problem Problem 1: Find the Daily Active Users for a Given Day Problem 2: Calculate the Daily Average Number of Items Across All User Carts Problem 3: Generate the Top Ten Most Added Items Across All User Carts Exercise 2-1: Summary Summary Chapter 3: Working with Data Docker Containers Docker Desktop Configuring Docker Apache Zeppelin Interpreters Notebooks Preparing Your Zeppelin Environment Running Apache Zeppelin with Docker Docker Network Docker Compose Volumes Environment Ports Using Apache Zeppelin Binding Interpreters Exercise 3-1: Reading Plain Text Files and Transforming DataFrames Converting Plain Text Files into DataFrames Peeking at the Contents of a DataFrame DataFrame Transformation with Pattern Matching Exercise 3-1: Summary Working with Structured Data Exercise 3-2: DataFrames and Semi-Structured Data Schema Inference Using Inferred Schemas Using Declared Schemas Steal the Schema Pattern Building a Data Definition All About the StructType StructField Spark Data Types Adding Metadata to Your Structured Schemas Exercise 3-2: Summary Using Interpreted Spark SQL Exercise 3-3: A Quick Introduction to SparkSQL Creating SQL Views Using the Spark SQL Zeppelin Interpreter Computing Averages Exercise 3-3: Summary Your First Spark ETL Exercise 3-4: An End-to-End Spark ETL Writing Structured Data Parquet Data Reading Parquet Data Exercise 3-4: Summary Summary Chapter 4: Transforming Data with Spark SQL and the DataFrame API Data Transformations Basic Data Transformations Exercise 4-1: Selections and Projections Data Generation Selection Filtering Projection Exercise 4-1: Summary Joins Exercise 4-2: Expanding Data Through Joins Inner Join Right Join Left Join Semi-Join Anti-Join Semi-Join and Anti-Join Aliases Using the IN Operator Negating the IN Operator Full Join Exercise 4-2: Summary Putting It All Together Exercise 4-3: Problem Solving with SQL Expressions and Conditional Queries Expressions as Columns Using an Inner Query Using Conditional Select Expressions Exercise 4-3: Summary Summary Chapter 5: Bridging Spark SQL with JDBC Overview MySQL on Docker Crash Course Starting Up the Docker Environment Docker MySQL Config Exercise 5-1: Exploring MySQL 8 on Docker Working with Tables Connecting to the MySQL Docker Container Using the MySQL Shell The Default Database Creating the Customers Table Inserting Customer Records Viewing the Customers Table Exercise 5-1: Summary Using RDBMS with Spark SQL and JDBC Managing Dependencies Exercise 5-2: Config-Driven Development with the Spark Shell and JDBC Configuration, Dependency Management, and Runtime File Interpretation in the Spark Shell Runtime Configuration Local Dependency Management Runtime Package Management Dynamic Class Compilation and Loading Spark Config: Access Patterns and Runtime Mutation Viewing the SparkConf Accessing the Runtime Configuration Iterative Development with the Spark Shell Describing Views and Tables Writing DataFrames to External MySQL Tables Generate Some New Customers Using JDBC DataFrameWriter SaveMode Exercise 5-2: Summary Continued Explorations Good Schemas Lead to Better Designs Write Customer Records with Minimal Schema Deduplicate, Reorder, and Truncate Your Table Drop Duplicates Sorting with Order By Truncating SQL Tables Stash and Replace Summary Chapter 6: Data Discovery and the Spark SQL Catalog Data Discovery and Data Catalogs Why Data Catalogs Matter Data Wishful Thinking Data Catalogs to the Rescue The Apache Hive Metastore Metadata with a Modern Twist Exercise 6-1: Enhancing Spark SQL with the Hive Metastore Configuring the Hive Metastore Create the Metastore Database Connect to the MySQL Docker Container Authenticate as the root the MySQL User Create the Hive Metastore Database Grant Access to the Metastore Create the Metastore Tables Authenticate as the dataeng User Switch Databases to the Metastore Import the Hive Metastore Tables Configuring Spark to Use the Hive Metastore Configure the Hive Site XML Configure Apache Spark to Connect to Your External Hive Metastore Using the Hive Metastore for Schema Enforcement Production Hive Metastore Considerations Exercise 6-1: Summary The Spark SQL Catalog Exercise 6-2: Using the Spark SQL Catalog Creating the Spark Session Spark SQL Databases Listing Available Databases Finding the Current Database Creating a Database Loading External Tables Using JDBC Listing Tables Creating Persistent Tables Finding the Existence of a Table Databases and Tables in the Hive Metastore View Hive Metastore Databases View Hive Metastore Tables Hive Table Parameters Working with Tables from the Spark SQL Catalog Data Discovery Through Table and Column-Level Annotations Adding Table-Level Descriptions and Listing Tables Adding Column Descriptions and Listing Columns Caching Tables Cache a Table in Spark Memory The Storage View of the Spark UI Force Spark to Cache Uncache Tables Clear All Table Caches Refresh a Table Testing Automatic Cache Refresh with Spark Managed Tables Removing Tables Drop Table Conditionally Drop a Table Using Spark SQL Catalyst to Remove a Table Exercise 6-2: Summary The Spark Catalyst Optimizer Introspecting Spark’s Catalyst Optimizer with Explain Logical Plan Parsing Logical Plan Analysis Unresolvable Errors Logical Plan Optimization Physical Planning Java Bytecode Generation Datasets Exercise 6-3: Converting DataFrames to Datasets Create the Customers Case Class Dataset Aliasing Mixing Catalyst and Scala Functionality Using Typed Catalyst Expressions Exercise 6-3: Summary Summary Chapter 7: Data Pipelines and Structured Spark Applications Data Pipelines Pipeline Foundations Spark Applications: Form and Function Interactive Applications Spark Shell Notebook Environments Batch Applications Stateless Batch Applications Stateful Batch Applications From Stateful Batch to Streaming Applications Streaming Applications Micro-Batch Processing Continuous Processing Designing Spark Applications Use Case: CoffeeCo and the Ritual of Coffee Thinking about Data Data Storytelling and Modeling Data Exercise 7-1: Data Modeling The Story Breaking Down the Story Extracting the Data Models Customer Store Product, Goods and Items Vendor Location Rating Exercise 7-1: Summary From Data Model to Data Application Every Application Begins with an Idea The Idea Exercise 7-2: Spark Application Blueprint Default Application Layout README.md build.sbt conf project src Common Spark Application Components Application Configuration Application Default Config Runtime Config Overrides Common Spark Application Initialization Dependable Batch Applications Exercise 7-2: Summary Connecting the Dots Application Goals Exercise 7-3: The SparkEventExtractor Application The Rating Event CustomerRatingEventType Designing the Runtime Configuration Planning to Launch Application Recap Assembling the SparkEventExtractor SparkEventExtractorApp Validate the Spark Configuration Write the Batch Job Spark Delegation Pattern SparkEventExtractor Compiling the Spark Application Exercise 7-3: Summary Testing Apache Spark Applications Adding Test Dependencies Exercise 7-4: Writing Your First Spark Test Suite Configure Spark in the SparkEventExtractorSpec Testing for Success Testing for Failures Fully Testing Your End-to-End Spark Application Exercise 7-4: Summary Summary Chapter 8: Workflow Orchestration with Apache Airflow Workflow Orchestration Apache Airflow When Orchestration Matters Working Together Exercise 8-1: Getting Airflow Up and Running Installing Airflow Add the Directories Add Environment Variables for Docker Compose Initialize Airflow Running Airflow Running Airflow in Detached Mode Start Things Up Tear Things Down Optimizing Your Local Data Engineering Environment Sanity Check: Is Airflow Running? Running an Airflow DAG Exercise 8-1: Summary The Core Components of Apache Airflow Tasks Operators Schedulers and Executors Local Execution Remote Execution Scheduling Spark Batch Jobs with Airflow Exercise 8-2: Installing the Spark Airflow Provider and Running a Spark DAG Locating the Airflow Containers Manually Installing the Spark Provider Using Containers for Runtime Consistency Running Spark Jobs with Apache Airflow Add Airflow Variables Add Airflow Connections Writing Your First Spark DAG Running the Spark DAG Exercise 8-2: Summary Running the SparkEventExtractorApp using Airflow Starting with a Working Spark Submit Exercise 8-3: Writing and Running the Customer Ratings Airflow DAG The Spark Configuration Local Job Configuration Spark Submit Properties Copy the Spark Application JAR into the Shared Spark JARs Location Running the Customer Ratings ETL DAG Exercise 8-3: Summary Looking at the Hybrid Architecture Setting Up the Extended Environment Migrating to MinIO from the Local File System Apply the Bucket Prefix Permissions Bootstrapping MySQL in a Common Location Continued Explorations Creating a User Summary Chapter 9: A Gentle Introduction to Stream Processing Stream Processing Use Case: Real-Time Parking Availability Time Series Data and Event Streams Do Events Stand Alone? Use Case: Tracking Customer Satisfaction The Event Time, Order of Events Captured, and the Delay Between Events All Tell a Story The Trouble with Time Priority Ordered Event Processing Patterns Real-Time Processing Near Real-Time Processing Batch Processing On-Demand or Just-In-Time Processing Foundations of Stream Processing Building Reliable Streaming Data Systems Managing Streaming Data Accountability Dealing with Data Problems in Flight Data Gatekeepers Software Development Kits Selecting the Right Data Protocol for the Job Serializable Structured Data Avro Message Format Avro Binary Format Enable Backward Compatibility and Preventing Data Corruption Best Practices for Streaming Avro Data Protobuf Message Format Code Generation Protobuf Binary Format Enable Backward Compatibility and Prevent Data Corruption Best Practices for Streaming Protobuf Data Remote Procedure Calls gRPC Define a gRPC Service gRPC Speaks HTTP/2 Summary Chapter 10: Patterns for Writing Structured Streaming Applications What Is Apache Spark Structured Streaming? Unbounded Tables and the Spark Structured Streaming Programming Model Processing Modes Micro-Batch Processing Continuous Processing Exercise Overview The Challenge The OrderEvent Format Leaning on the Spark Struct Exercise 10-1: Using Redis Streams to Drive Spark Structured Streaming Spinning Up the Local Environment Connecting to Redis Connect to the Redis CLI Enable the Redis Command Monitor Creating the Redis Stream Stream Guarantees Redis Streams Are Append-Only Event Publishing to a Redis Stream Consume Events from a Redis Stream The SparkRedisStreamsApp Compile the Application Build the Docker Container Running the Application Redis Streams Consumer Groups and Offsets Consumer Groups Creating the Next Batch Exercise 10-1: Summary Exercise 10-2: Breaking Down the Redis Streams Application DataStreamReader Is the Schema Required? DataStreamWriter Lazy Invocation Depending on Dependency Injection Writing Streaming Data Streaming Query Application Entry Point Exercise 10-2: Summary Exercise 10-3: Reliable Stateful Stream Processing with Checkpoints Why Checkpoints Matter Reliable Checkpoints Enabling Reliable Checkpoints Running the Stateful Application Observing the Stateful Behavior How Checkpoints Work Deleting Checkpoints Exercise 10-3: Summary Exercise 10-4: Using Triggers to Control Application Runtime Behavior The Updated Run Method with Triggers Continuous Mode Processing Periodic Processing Stateful Batch Jobs Streaming Table Writer Running the Stateful Application with Triggers Run with Trigger Once Run with ProcessingTime Running Continuously Viewing the Results Exercise 10-4: Summary Summary Chapter 11: Apache Kafka and Spark Structured Streaming Apache Kafka in a Nutshell Asynchronous Communication Horizontal Scalability High Service Availability and Consistency Disaster Recovery Event Streams and Data Pipelines Ecosystem Chapter Exercises Exercise 11-1: Getting Up and Running with Apache Kafka Exercise 11-1: Materials Spinning Up Your Local Environment Creating Your First Kafka Topic How Topics Behave Creating a Kafka Topic Kafka Topic Management Listing Kafka Topics Describing a Topic Modifying a Topic Altering Topic Configurations Increasing Topic Partition Replication Truncating Topics Reducing the Topic Retention Period Increasing the Topic Retention Period Deleting Kafka Topics Securing Access to Topics Firewall Rules Access Control Lists End-to-End Encryption with Mutual TLS Exercise 11-1: Summary Exercise 11-2: Generating Binary Serializable Event Data with Spark and Publishing to Kafka Exercise 11-2: Materials CoffeeOrder Event Format Compiling Protobuf Messages Protobuf Message Interoperability with Spark and Kafka Adding the Protobuf and Kafka Spark Dependencies Escaping JAR Hell Writing the CoffeeOrderGenerator Running the Generator Exercise 11-2: Summary Exercise 11-3: Consuming Streams of Serializable Structured Data with Spark Exercise 11-3: Materials Consuming Binary Data from Apache Kafka Topic Subscription Topic Assignment Throttling Kafka Handling Failures Splitting the Partitions From Kafka Rows to Datasets Running the Consumer Application Exercise 11-3: Summary Summary Chapter 12: Analytical Processing and Insights Exercises and Materials Setting Up the Environment Spinning Up the Local Environment Using Common Spark Functions for Analytical Preprocessing Exercise 12-1: Preprocessing Datasets for Analytics Working with Timestamps and Dates Common Date and Timestamp Functions Applying Higher-Order Functions Using withColumn Using Date Addition and Subtraction Calendar Functions Time Zones and the Spark SQL Session Configuring the Time Zone Modifying the Spark Time Zone at Runtime Using Set Time Zone Seasonality, Time Zones, and Insights Timestamps and Dates Summary Preparing Data for Analysis Replacing Null Values on a DataFrame Labeling Data Using Case Statements Case Statements on the Dataset Case Statements on a Spark SQL Table The Case for Case Statements User-Defined Functions in Spark Using Scala Functions in UDFs Using Function Literals in UDFs Using Inline Functions in UDFs How User-Defined Functions Work Using UDFs with the Spark DSL Registering UDFs for Spark SQL Introspecting UDF Functions UDF Visibility and Availability Creating and Registering Permanent UDFs in the Spark SQL Catalog Using UDFs with Spark SQL Regarding User-Defined Functions Exercise 12-1: Summary Analytical Processing and Insights Engineering Data Aggregation Exercise 12-2: Grouped Data, Aggregations, Analytics, and Insights Relational Grouped Datasets Columnar Aggregations with Grouping Aggregating Using the Spark DSL Computing Summary Statistics for Insights Using Describe to Compute Simple Summary Statistics Using Agg to Compute Complex Summary Statistics Using Rollups for Hierarchical Aggregations Using Pivots Creating the Order Items Using Array Explode to Create Many Rows from One Using Array Explode and Join to Calculate the Order Total Using Pivots to Calculate Price Aggregates Across Menu Item Categories on a Per Order Basis Analytical Window Functions Calculating the Cumulative Average Items Purchased Difference Between Transactions Exercise 12-2: Summary Summary Chapter 13: Advanced Analytics with Spark Stateful Structured Streaming Exercises and Materials Stateful Aggregations with Structured Streaming Creating Windowed Aggregations Over Time Window Functions vs. Windowing Windowing and Controlling Output Streaming Output Modes Complete Update Append Append Output Mode Watermarks for Streaming Data Chapter Exercises Overview Input Data Format (CoffeeOrder) Output Windowed Store Revenue Aggregates Exercise 13-1: Store Revenue Aggregations Structured Streaming Application Trait Stream Reader Output Stream Decorator Conditional Object Decoration Spark Stateful Aggregations App Streaming Aggregations The Transform Method The Process Method Exercise 13-1: Summary Typed Arbitrary Stateful Computations KeyValueGroupedDataset Iterative Computation with *mapGroupsWithState MapGroupsWithState Function FlatMapGroupsWithState Function Exercise 13-2: Arbitrary Stateful Computations on Typed Datasets SparkTypedStatefulAggregationsApp TypedRevenueAggregates CoffeeOrderForAnalysis TypedStoreRevenueAggregates State Function Average of Averages Exercise 13-2: Summary Exercise 13-3: Testing Structured Streaming Applications MemoryStream Exercise 13-3: Summary Summary Chapter 14: Deploying Mission-Critical Spark Applications on Spark Standalone Deployment Patterns Running Spark Applications Deploying Spark Applications Spark Cluster Modes and Resource Managers Spark Standalone Mode Spark Standalone: High Availability Mode The Failover Process Deploy Modes: Client vs. Cluster Mode Client Mode Cluster Mode Distributed Shared Resources and Resource Scheduling for Multi-Tenancy Controlling Resource Allocations, Application Behavior, and Scheduling Elastically Scaling Spark Applications with Dynamic Resource Allocation Spark Listeners and Application Monitoring Spark Listener Observing Structured Streaming Behavior with the StreamingQueryListener Monitoring Patterns Spark Standalone Cluster and Application Migration Strategies Anti-Pattern: Hope for the Best Best Practices Regarding Containers and Spark Standalone Managed Spark Summary Chapter 15: Deploying Mission-Critical Spark Applications on Kubernetes Kubernetes 101 Part 1: Getting Up and Running on Kubernetes Using Minikube to Power Local Kubernetes A Hands-On Tour of the Kubernetes APIs Nodes Common Node Services and Responsibilities kubelet kube-proxy Container Runtime Viewing All Cluster Nodes Namespaces Creating a Namespace Deleting a Namespace Creating Kubernetes Resources with Apply Pods, Containers, and Deployments What Is a Pod? Creating a Redis Pod Spec metadata spec Scheduling the Redis Pod Listing, Describing, and Locating Pods Executing Commands on the Redis Pod Deleting the Redis Pod Deployments and Handling Automatic Recovery on Failure Creating a Kubernetes Deployment Deployments and Resource Management Locate the Newly Deployed Pod Connect to the Managed Pod Observe Pod States and Behavior Delete the Managed Pod Persisting Data Between Deploys Persistent Volumes (PV) Configuring a Persistent Volume (PV) Create the Persistent Volume View the Persistent Volume Describe the Details of the Persistent Volume Persistent Volume Claims (PVC) Configuring a Persistent Volume Claim (PVC) Create the Persistent Volume Claim (PVC) Use Get to View the PVC View the PV Enhancing the Redis Deployment with Data Persistence Kubernetes Services and DNS Creating a Service Viewing Services (SVC) Part 1: Summary Part 2: Deploying Spark Structured Streaming Applications on Kubernetes Spark Namespace, Application Environment, and Role-Based Access Controls Define the Spark Apps Namespace Role-Based Access Control (RBAC) Roles Service Accounts Role Bindings Creating the Spark Controller Service Account, Role, and Role-Binding Redis Streams K8s Application Building the Redis Streams Application Building the Application Container Deploying the Redis Streams Application on Kubernetes Externalized Dependency Management External Configuration with ConfigMaps Secrets Securely Storing the Hive Metastore Configuration Securely Storing the MinIO Access Parameters Viewing Secrets in the Namespace Cross Namespace DNS and Network Policies Controlling Deployments with Jump Pods Triggering Spark Submit from the Jump Pod Connecting the Deployment Dots Viewing the Driver Logs Viewing the Driver UI Sending CoffeeOrder Events Deleting the Driver and Jump Pods Automating the Deployment and Automatic Recovery Part 2: Summary Conclusion After Thoughts Index