دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
دسته بندی: پایگاه داده ها ویرایش: 1 نویسندگان: Gerard Maas. Francois Garillot سری: ISBN (شابک) : 1491944242, 9781491944240 ناشر: O’Reilly Media سال نشر: 2019 تعداد صفحات: 0 زبان: English فرمت فایل : EPUB (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) حجم فایل: 4 مگابایت
کلمات کلیدی مربوط به کتاب پردازش جریان با Apache Spark: تسلط بر جریان ساختاری و جریان جرقه ای: رایانش ابری، یادگیری ماشین، تجزیه و تحلیل، اسپارک آپاچی، طوفان آپاچی، نظارت، پردازش جریان، آپاچی کافکا، پردازش دستهای، خوشهها، Apache Flink، Spark SQL، مجموعه دادههای توزیعشده انعطافپذیر، تنظیم عملکرد، جریان اسپارک، پردازش معماری Lambda، معماری، جریان ساخت یافته
در صورت تبدیل فایل کتاب Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب پردازش جریان با Apache Spark: تسلط بر جریان ساختاری و جریان جرقه ای نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
قبل از اینکه بتوانید ابزارهای تحلیلی برای به دست آوردن بینش سریع بسازید، ابتدا باید بدانید که چگونه داده ها را در زمان واقعی پردازش کنید. با این راهنمای عملی، توسعه دهندگان آشنا با Apache Spark یاد خواهند گرفت که چگونه این چارچوب درون حافظه را برای استفاده در جریان داده ها قرار دهند. متوجه خواهید شد که چگونه Spark شما را قادر میسازد تا کارهای پخش جریانی را تقریباً به همان روشی که کارهای دستهای مینویسید بنویسید. نویسندگان جرارد ماس و فرانسوا گاریلو به شما کمک می کنند تا زیربنای نظری آپاچی اسپارک را کشف کنید. این راهنمای جامع دارای دو بخش است که APIهای جریانی را که Spark اکنون پشتیبانی میکند مقایسه و مقایسه میکند: کتابخانه اصلی Spark Streaming و API جدیدتر Structured Streaming. • مفاهیم اساسی پردازش جریان را بیاموزید و معماری های مختلف جریان را بررسی کنید • جریان ساختاریافته را از طریق مثال های عملی کاوش کنید. جنبه های مختلف پردازش جریانی را با جزئیات بیاموزید • با Spark Streaming، کارها و برنامه های پخش جریانی را ایجاد و اجرا کنید. Spark Streaming را با سایر APIهای Spark یکپارچه کنید • تکنیک های پیشرفته جریان جرقه، از جمله الگوریتم های تقریبی و الگوریتم های یادگیری ماشینی را بیاموزید • Apache Spark را با سایر پروژههای پردازش جریان، از جمله Apache Storm، Apache Flink، و Apache Kafka Streams مقایسه کنید.
Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs. Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API. • Learn fundamental stream processing concepts and examine different streaming architectures • Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail • Create and operate streaming jobs and applications with Spark Streaming; integrate Spark Streaming with other Spark APIs • Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms • Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams
Copyright Table of Contents Foreword Preface Who Should Read This Book? Installing Spark Learning Scala The Way Ahead Bibliography Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments From Gerard From François Part I. Fundamentals of Stream Processing with Apache Spark Chapter 1. Introducing Stream Processing What Is Stream Processing? Batch Versus Stream Processing The Notion of Time in Stream Processing The Factor of Uncertainty Some Examples of Stream Processing Scaling Up Data Processing MapReduce The Lesson Learned: Scalability and Fault Tolerance Distributed Stream Processing Stateful Stream Processing in a Distributed System Introducing Apache Spark The First Wave: Functional APIs The Second Wave: SQL A Unified Engine Spark Components Spark Streaming Structured Streaming Where Next? Chapter 2. Stream-Processing Model Sources and Sinks Immutable Streams Defined from One Another Transformations and Aggregations Window Aggregations Tumbling Windows Sliding Windows Stateless and Stateful Processing Stateful Streams An Example: Local Stateful Computation in Scala A Stateless Definition of the Fibonacci Sequence as a Stream Transformation Stateless or Stateful Streaming The Effect of Time Computing on Timestamped Events Timestamps as the Provider of the Notion of Time Event Time Versus Processing Time Computing with a Watermark Summary Chapter 3. Streaming Architectures Components of a Data Platform Architectural Models The Use of a Batch-Processing Component in a Streaming Application Referential Streaming Architectures The Lambda Architecture The Kappa Architecture Streaming Versus Batch Algorithms Streaming Algorithms Are Sometimes Completely Different in Nature Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms Summary Chapter 4. Apache Spark as a Stream-Processing Engine The Tale of Two APIs Spark’s Memory Usage Failure Recovery Lazy Evaluation Cache Hints Understanding Latency Throughput-Oriented Processing Spark’s Polyglot API Fast Implementation of Data Analysis To Learn More About Spark Summary Chapter 5. Spark’s Distributed Processing Model Running Apache Spark with a Cluster Manager Examples of Cluster Managers Spark’s Own Cluster Manager Understanding Resilience and Fault Tolerance in a Distributed System Fault Recovery Cluster Manager Support for Fault Tolerance Data Delivery Semantics Microbatching and One-Element-at-a-Time Microbatching: An Application of Bulk-Synchronous Processing One-Record-at-a-Time Processing Microbatching Versus One-at-a-Time: The Trade-Offs Bringing Microbatch and One-Record-at-a-Time Closer Together Dynamic Batch Interval Structured Streaming Processing Model The Disappearance of the Batch Interval Chapter 6. Spark’s Resilience Model Resilient Distributed Datasets in Spark Spark Components Spark’s Fault-Tolerance Guarantees Task Failure Recovery Stage Failure Recovery Driver Failure Recovery Summary Appendix A. References for Part I Part II. Structured Streaming Chapter 7. Introducing Structured Streaming First Steps with Structured Streaming Batch Analytics Streaming Analytics Connecting to a Stream Preparing the Data in the Stream Operations on Streaming Dataset Creating a Query Start the Stream Processing Exploring the Data Summary Chapter 8. The Structured Streaming Programming Model Initializing Spark Sources: Acquiring Streaming Data Available Sources Transforming Streaming Data Streaming API Restrictions on the DataFrame API Sinks: Output the Resulting Data format outputMode queryName option options trigger start() Summary Chapter 9. Structured Streaming in Action Consuming a Streaming Source Application Logic Writing to a Streaming Sink Summary Chapter 10. Structured Streaming Sources Understanding Sources Reliable Sources Must Be Replayable Sources Must Provide a Schema Available Sources The File Source Specifying a File Format Common Options Common Text Parsing Options (CSV, JSON) JSON File Source Format CSV File Source Format Parquet File Source Format Text File Source Format The Kafka Source Setting Up a Kafka Source Selecting a Topic Subscription Method Configuring Kafka Source Options Kafka Consumer Options The Socket Source Configuration Operations The Rate Source Options Chapter 11. Structured Streaming Sinks Understanding Sinks Available Sinks Reliable Sinks Sinks for Experimentation The Sink API Exploring Sinks in Detail The File Sink Using Triggers with the File Sink Common Configuration Options Across All Supported File Formats Common Time and Date Formatting (CSV, JSON) The CSV Format of the File Sink The JSON File Sink Format The Parquet File Sink Format The Text File Sink Format The Kafka Sink Understanding the Kafka Publish Model Using the Kafka Sink The Memory Sink Output Modes The Console Sink Options Output Modes The Foreach Sink The ForeachWriter Interface TCP Writer Sink: A Practical ForeachWriter Example The Moral of this Example Troubleshooting ForeachWriter Serialization Issues Chapter 12. Event Time–Based Stream Processing Understanding Event Time in Structured Streaming Using Event Time Processing Time Watermarks Time-Based Window Aggregations Defining Time-Based Windows Understanding How Intervals Are Computed Using Composite Aggregation Keys Tumbling and Sliding Windows Record Deduplication Summary Chapter 13. Advanced Stateful Operations Example: Car Fleet Management Understanding Group with State Operations Internal State Flow Using MapGroupsWithState Using FlatMapGroupsWithState Output Modes Managing State Over Time Summary Chapter 14. Monitoring Structured Streaming Applications The Spark Metrics Subsystem Structured Streaming Metrics The StreamingQuery Instance Getting Metrics with StreamingQueryProgress The StreamingQueryListener Interface Implementing a StreamingQueryListener Chapter 15. Experimental Areas: Continuous Processing and Machine Learning Continuous Processing Understanding Continuous Processing Using Continuous Processing Limitations Machine Learning Learning Versus Exploiting Applying a Machine Learning Model to a Stream Example: Estimating Room Occupancy by Using Ambient Sensors Online Training Appendix B. References for Part II Part III. Spark Streaming Chapter 16. Introducing Spark Streaming The DStream Abstraction DStreams as a Programming Model DStreams as an Execution Model The Structure of a Spark Streaming Application Creating the Spark Streaming Context Defining a DStream Defining Output Operations Starting the Spark Streaming Context Stopping the Streaming Process Summary Chapter 17. The Spark Streaming Programming Model RDDs as the Underlying Abstraction for DStreams Understanding DStream Transformations Element-Centric DStream Transformations RDD-Centric DStream Transformations Counting Structure-Changing Transformations Summary Chapter 18. The Spark Streaming Execution Model The Bulk-Synchronous Architecture The Receiver Model The Receiver API How Receivers Work The Receiver’s Data Flow The Internal Data Resilience Receiver Parallelism Balancing Resources: Receivers Versus Processing Cores Achieving Zero Data Loss with the Write-Ahead Log The Receiverless or Direct Model Summary Chapter 19. Spark Streaming Sources Types of Sources Basic Sources Receiver-Based Sources Direct Sources Commonly Used Sources The File Source How It Works The Queue Source How It Works Using a Queue Source for Unit Testing A Simpler Alternative to the Queue Source: The ConstantInputDStream The Socket Source How It Works The Kafka Source Using the Kafka Source How It Works Where to Find More Sources Chapter 20. Spark Streaming Sinks Output Operations Built-In Output Operations print saveAsxyz foreachRDD Using foreachRDD as a Programmable Sink Third-Party Output Operations Chapter 21. Time-Based Stream Processing Window Aggregations Tumbling Windows Window Length Versus Batch Interval Sliding Windows Sliding Windows Versus Batch Interval Sliding Windows Versus Tumbling Windows Using Windows Versus Longer Batch Intervals Window Reductions reduceByWindow reduceByKeyAndWindow countByWindow countByValueAndWindow Invertible Window Aggregations Slicing Streams Summary Chapter 22. Arbitrary Stateful Streaming Computation Statefulness at the Scale of a Stream updateStateByKey Limitation of updateStateByKey Performance Memory Usage Introducing Stateful Computation with mapwithState Using mapWithState Event-Time Stream Computation Using mapWithState Chapter 23. Working with Spark SQL Spark SQL Accessing Spark SQL Functions from Spark Streaming Example: Writing Streaming Data to Parquet Dealing with Data at Rest Using Join to Enrich the Input Stream Join Optimizations Updating Reference Datasets in a Streaming Application Enhancing Our Example with a Reference Dataset Summary Chapter 24. Checkpointing Understanding the Use of Checkpoints Checkpointing DStreams Recovery from a Checkpoint Limitations The Cost of Checkpointing Checkpoint Tuning Chapter 25. Monitoring Spark Streaming The Streaming UI Understanding Job Performance Using the Streaming UI Input Rate Chart Scheduling Delay Chart Processing Time Chart Total Delay Chart Batch Details The Monitoring REST API Using the Monitoring REST API Information Exposed by the Monitoring REST API The Metrics Subsystem The Internal Event Bus Interacting with the Event Bus Summary Chapter 26. Performance Tuning The Performance Balance of Spark Streaming The Relationship Between Batch Interval and Processing Delay The Last Moments of a Failing Job Going Deeper: Scheduling Delay and Processing Delay Checkpoint Influence in Processing Time External Factors that Influence the Job’s Performance How to Improve Performance? Tweaking the Batch Interval Limiting the Data Ingress with Fixed-Rate Throttling Backpressure Dynamic Throttling Tuning the Backpressure PID Custom Rate Estimator A Note on Alternative Dynamic Handling Strategies Caching Speculative Execution Appendix C. References for Part III Part IV. Advanced Spark Streaming Techniques Chapter 27. Streaming Approximation and Sampling Algorithms Exactness, Real Time, and Big Data Exactness Real-Time Processing Big Data The Exactness, Real-Time, and Big Data triangle Big Data and Real Time Approximation Algorithms Hashing and Sketching: An Introduction Counting Distinct Elements: HyperLogLog Role-Playing Exercise: If We Were a System Administrator Practical HyperLogLog in Spark Counting Element Frequency: Count Min Sketches Introducing Bloom Filters Bloom Filters with Spark Computing Frequencies with a Count-Min Sketch Ranks and Quantiles: T-Digest T-Digest in Spark Reducing the Number of Elements: Sampling Random Sampling Stratified Sampling Chapter 28. Real-Time Machine Learning Streaming Classification with Naive Bayes streamDM Introduction Naive Bayes in Practice Training a Movie Review Classifier Introducing Decision Trees Hoeffding Trees Hoeffding Trees in Spark, in Practice Streaming Clustering with Online K-Means K-Means Clustering Online Data and K-Means The Problem of Decaying Clusters Streaming K-Means with Spark Streaming Appendix D. References for Part IV Part V. Beyond Apache Spark Chapter 29. Other Distributed Real-Time Stream Processing Systems Apache Storm Processing Model The Storm Topology The Storm Cluster Compared to Spark Apache Flink A Streaming-First Framework Compared to Spark Kafka Streams Kafka Streams Programming Model Compared to Spark In the Cloud Amazon Kinesis on AWS Microsoft Azure Stream Analytics Apache Beam/Google Cloud Dataflow Chapter 30. Looking Ahead Stay Plugged In Seek Help on Stack Overflow Start Discussions on the Mailing Lists Attend Conferences Attend Meetups Read Books Contributing to the Apache Spark Project Appendix E. References for Part V Index About the Authors Colophon