دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
ویرایش: [1 ed.]
نویسندگان: SABA SHAH
سری:
ISBN (شابک) : 9781804619780
ناشر: Packt Publishing
سال نشر: 2024
تعداد صفحات: 274
زبان: English
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود)
حجم فایل: 7 Mb
در صورت ایرانی بودن نویسنده امکان دانلود وجود ندارد و مبلغ عودت داده خواهد شد
در صورت تبدیل فایل کتاب Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب توسعه دهنده خبره Databricks برای Apache Spark با استفاده از پایتون: راهنمای نهایی برای دریافت گواهینامه نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
Building Modern Data Applications Using Databricks Lakehouse
Contributors
About the author
About the reviewer
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1:Near-Real-Time Data Pipelines for the Lakehouse
1
An Introduction to Delta Live Tables
Technical requirements
The emergence of the lakehouse
The Lambda architectural pattern
Introducing the medallion architecture
The Databricks lakehouse
The maintenance predicament of a streaming application
What is the DLT framework?
How is DLT related to Delta Lake?
Introducing DLT concepts
Streaming tables
Materialized views
Views
Pipeline
Pipeline triggers
Workflow
Types of Databricks compute
Databricks Runtime
Unity Catalog
A quick Delta Lake primer
The architecture of a Delta table
The contents of a transaction commit
Supporting concurrent table reads and writes
Tombstoned data files
Calculating Delta table state
Time travel
Tracking table changes using change data feed
A hands-on example – creating your first Delta Live Tables pipeline
Summary
2
Applying Data Transformations Using Delta Live Tables
Technical requirements
Ingesting data from input sources
Ingesting data using Databricks Auto Loader
Scalability challenge in structured streaming
Using Auto Loader with DLT
Applying changes to downstream tables
APPLY CHANGES command
The DLT reconciliation process
Publishing datasets to Unity Catalog
Why store datasets in Unity Catalog?
Creating a new catalog
Assigning catalog permissions
Data pipeline settings
The DLT product edition
Pipeline execution mode
Databricks runtime
Pipeline cluster types
A serverless compute versus a traditional compute
Loading external dependencies
Data pipeline processing modes
Hands-on exercise – applying SCD Type 2 changes
Summary
3
Managing Data Quality Using Delta Live Tables
Technical requirements
Defining data constraints in Delta Lake
Using temporary datasets to validate data processing
An introduction to expectations
Expectation composition
Hands-on exercise – writing your first data quality expectation
Acting on failed expectations
Hands-on example – failing a pipeline run due to poor data quality
Applying multiple data quality expectations
Decoupling expectations from a DLT pipeline
Hands-on exercise – quarantining bad data for correction
Summary
4
Scaling DLT Pipelines
Technical requirements
Scaling compute to handle demand
Hands-on example – setting autoscaling properties using the Databricks REST API
Automated table maintenance tasks
Why auto compaction is important
Vacuuming obsolete table files
Moving compute closer to the data
Optimizing table layouts for faster table updates
Rewriting table files during updates
Data skipping using table partitioning
Delta Lake Z-ordering on MERGE columns
Improving write performance using deletion vectors
Serverless DLT pipelines
Introducing Enzyme, a performance optimization layer
Summary
Part 2:Securing the Lakehouse Using the Unity Catalog
5
Mastering Data Governance in the Lakehouse with Unity Catalog
Technical requirements
Understanding data governance in a lakehouse
Introducing the Databricks Unity Catalog
A problem worth solving
An overview of the Unity Catalog architecture
Unity Catalog-enabled cluster types
Unity Catalog object model
Enabling Unity Catalog on an existing Databricks workspace
Identity federation in Unity Catalog
Data discovery and cataloging
Tracking dataset relationships using lineage
Observability with system tables
Tracing the lineage of other assets
Fine-grained data access
Hands-on example – data masking healthcare datasets
Summary
6
Managing Data Locations in Unity Catalog
Technical requirements
Creating and managing data catalogs in Unity Catalog
Managed data versus external data
Saving data to storage volumes in Unity Catalog
Setting default locations for data within Unity Catalog
Isolating catalogs to specific workspaces
Creating and managing external storage locations in Unity Catalog
Storing cloud service authentication using storage credentials
Querying external systems using Lakehouse Federation
Hands-on lab – extracting document text for a generative AI pipeline
Generating mock documents
Defining helper functions
Choosing a file format randomly
Creating/assembling the DLT pipeline
Summary
7
Viewing Data Lineage Using Unity Catalog
Technical requirements
Introducing data lineage in Unity Catalog
Tracing data origins using the Data Lineage REST API
Visualizing upstream and downstream transformations
Identifying dependencies and impacts
Hands-on lab – documenting data lineage across an organization
Summary
Part 3:Continuous Integration, Continuous Deployment, and Continuous Monitoring
8
Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform
Technical requirements
Introducing the Databricks provider for Terraform
Setting up a local Terraform environment
Importing the Databricks Terraform provider
Configuring workspace authentication
Defining a DLT pipeline source notebook
Applying workspace changes
Configuring DLT pipelines using Terraform
name
notification
channel
development
continuous
edition
photon
configuration
library
cluster
catalog
target
storage
Automating DLT pipeline deployment
Hands-on exercise – deploying a DLT pipeline using VS Code
Setting up VS Code
Creating a new Terraform project
Defining the Terraform resources
Deploying the Terraform project
Summary
9
Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment
Technical requirements
Introduction to Databricks Asset Bundles
Elements of a DAB configuration file
Specifying a deployment mode
Databricks Asset Bundles in action
User-to-machine authentication
Machine-to-machine authentication
Initializing an asset bundle using templates
Hands-on exercise – deploying your first DAB
Hands-on exercise – simplifying cross-team collaboration with GitHub Actions
Setting up the environment
Configuring the GitHub Action
Testing the workflow
Versioning and maintenance
Summary
10
Monitoring Data Pipelines in Production
Technical requirements
Introduction to data pipeline monitoring
Exploring ways to monitor data pipelines
Using DBSQL Alerts to notify data validity
Pipeline health and performance monitoring
Hands-on exercise – querying data quality events for a dataset
Data quality monitoring
Introducing Lakehouse Monitoring
Hands-on exercise – creating a lakehouse monitor
Best practices for production failure resolution
Handling pipeline update failures
Recovering from table transaction failure
Hands-on exercise – setting up a webhook alert when a job runs longer than expected
Summary
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book