برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید

09117307688
09117179751

در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید

دسترسی نامحدود

برای کاربرانی که ثبت نام کرده اند

ضمانت بازگشت وجه

درصورت عدم همخوانی توضیحات با کتاب

پشتیبانی

از ساعت 7 صبح تا 10 شب

دانلود کتاب The Site Reliability Workbook: Practical Ways to Implement SRE

دانلود کتاب کتاب کارایی قابلیت اطمینان سایت: راه های عملی برای پیاده سازی SRE

مشخصات کتاب

The Site Reliability Workbook: Practical Ways to Implement SRE

دسته بندی: مدیریت سیستم
ویرایش:  
نویسندگان: Betsy Beyer et al. (eds.)  
سری:  
ISBN (شابک) : 1492029505, 9781492029502 
ناشر: O’Reilly 
سال نشر: 2018 
تعداد صفحات: 508 
زبان: English 
فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) 
حجم فایل: 14 مگابایت

قیمت کتاب (تومان) : 47,000

میانگین امتیاز به این کتاب :
تعداد امتیاز دهندگان : 4

در صورت تبدیل فایل کتاب The Site Reliability Workbook: Practical Ways to Implement SRE به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.

توجه داشته باشید کتاب کتاب کارایی قابلیت اطمینان سایت: راه های عملی برای پیاده سازی SRE نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.

توضیحاتی در مورد کتاب کتاب کارایی قابلیت اطمینان سایت: راه های عملی برای پیاده سازی SRE

در سال 2016، کتاب مهندسی قابلیت اطمینان سایت Google یک بحث صنعتی را در مورد معنای اجرای خدمات تولیدی امروز - و اینکه چرا ملاحظات قابلیت اطمینان برای طراحی خدمات اساسی است، برانگیخت. اکنون، مهندسان Google که روی آن پرفروش کار کرده‌اند، کتاب کار قابلیت اطمینان سایت را معرفی می‌کنند، یک همراه عملی که از نمونه‌های عینی استفاده می‌کند تا به شما نشان دهد چگونه اصول و شیوه‌های SRE را در محیط خود به کار ببرید.

این کتاب کار جدید نه تنها نمونه های عملی از تجربیات Google را ترکیب می کند، بلکه مطالعات موردی از مشتریان پلتفرم ابری Google را نیز ارائه می دهد که این سفر را پشت سر گذاشته اند. Evernote، The Home Depot، نیویورک تایمز، و سایر شرکت‌ها تجربیاتی را که به سختی به دست آورده‌اند، بیان می‌کنند که چه چیزی برای آن‌ها مفید بوده و چه چیزی مفید نیست.

در این کتاب کار شیرجه بزنید و یاد بگیرید. چگونه می توانید روش SRE خود را، بدون توجه به اندازه شرکت شما، گسترش دهید.

شما خواهید آموخت:

چگونه خدمات قابل اعتماد را در محیط هایی که ندارید اجرا کنید. به طور کامل کنترل کنید-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------. حفاری از اضافه بار عملیاتی
روش هایی برای شروع SRE از گرین فیلد یا براون فیلد

توضیحاتی درمورد کتاب به خارجی

In 2016, Google's Site Reliability Engineering book ignited an industry discussion on what it means to run production services today--and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google's experiences, but also provides case studies from Google's Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn't.

Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

You'll learn:

How to run reliable services in environments you don't completely control--like cloud
Practical applications of how to create, monitor, and run your services via Service Level Objectives
How to convert existing ops teams to SRE--including how to dig out of operational overload
Methods for starting SRE from either greenfield or brownfield

فهرست مطالب

Foreword I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Foreword II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii

1. How SRE Relates to DevOps. . . . . . . . . . . . . . 1
Background on DevOps 2
No More Silos 2
Accidents Are Normal 3
Change Should Be Gradual 3
Tooling and Culture Are Interrelated 3
Measurement Is Crucial 4
Background on SRE 4
Operations Is a Software Problem 4
Manage by Service Level Objectives (SLOs) 5
Work to Minimize Toil 5
Automate This Year’s Job Away 6
Move Fast by Reducing the Cost of Failure 6
Share Ownership with Developers 6
Use the Same Tooling, Regardless of Function or Job Title 7
Compare and Contrast 7
Organizational Context and Fostering Successful Adoption 9
Narrow, Rigid Incentives Narrow Your Success 9
It’s Better to Fix It Yourself; Don’t Blame Someone Else 10
Consider Reliability Work as a Specialized Role 10
When Can Substitute for Whether 11
Strive for Parity of Esteem: Career and Financial 12

2. Implementing SLOs. . . . . . . . . . . . . . . . . . . 17
Why SREs Need SLOs 17
Getting Started 18
Reliability Targets and Error Budgets 19
What to Measure: Using SLIs 20
A Worked Example 23
Moving from SLI Specification to SLI Implementation 25
Measuring the SLIs 26
Using the SLIs to Calculate Starter SLOs 28
Choosing an Appropriate Time Window 29
Getting Stakeholder Agreement 30
Establishing an Error Budget Policy 31
Documenting the SLO and Error Budget Policy 32
Dashboards and Reports 33
Continuous Improvement of SLO Targets 34
Improving the Quality of Your SLO 35
Decision Making Using SLOs and Error Budgets 37
Advanced Topics 38
Modeling User Journeys 39
Grading Interaction Importance 39
Modeling Dependencies 40
Experimenting with Relaxing Your SLOs 41
Conclusion 42

3. SLO Engineering Case Studies. . . . . . . . . . 43
Evernote’s SLO Story 43
Why Did Evernote Adopt the SRE Model? 44
Introduction of SLOs: A Journey in Progress 45
Breaking Down the SLO Wall Between Customer and Cloud Provider 48
Current State 49
The Home Depot’s SLO Story 49
The SLO Culture Project 50
Our First Set of SLOs 52
Evangelizing SLOs 54
Automating VALET Data Collection 55
The Proliferation of SLOs 57
Applying VALET to Batch Applications 57
Using VALET in Testing 58
Future Aspirations 58
Summary 59
Conclusion 60

4. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . 61
Desirable Features of a Monitoring Strategy 62
Speed 62
Calculations 62
Interfaces 63
Alerts 64
Sources of Monitoring Data 64
Examples 65
Managing Your Monitoring System 67
Treat Your Configuration as Code 67
Encourage Consistency 68
Prefer Loose Coupling 68
Metrics with Purpose 69
Intended Changes 70
Dependencies 70
Saturation 71
Status of Served Traffic 72
Implementing Purposeful Metrics 72
Testing Alerting Logic 72
Conclusion 73

5. Alerting on SLOs. . . . . . . . . . . . . . . . . . . . . 75
Alerting Considerations 75
Ways to Alert on Significant Events 76
1: Target Error Rate ≥ SLO Threshold 76
2: Increased Alert Window 78
3: Incrementing Alert Duration 79
4: Alert on Burn Rate 80
5: Multiple Burn Rate Alerts 82
6: Multiwindow, Multi-Burn-Rate Alerts 84
Low-Traffic Services and Error Budget Alerting 86
Generating Artificial Traffic 87
Combining Services 87
Making Service and Infrastructure Changes 87
Lowering the SLO or Increasing the Window 88
Extreme Availability Goals 89
Alerting at Scale 89
Conclusion 91

6. Eliminating Toil. . . . . . . . . . . . . . . . . . . . . . 93
What Is Toil? 94
Measuring Toil 96
Toil Taxonomy 98
Business Processes 98
Production Interrupts 99
Release Shepherding 99
Migrations 99
Cost Engineering and Capacity Planning 100
Troubleshooting for Opaque Architectures 100
Toil Management Strategies 101
Identify and Measure Toil 101
Engineer Toil Out of the System 101
Reject the Toil 101
Use SLOs to Reduce Toil 102
Start with Human-Backed Interfaces 102
Provide Self-Service Methods 102
Get Support from Management and Colleagues 103
Promote Toil Reduction as a Feature 103
Start Small and Then Improve 103
Increase Uniformity 103
Assess Risk Within Automation 104
Automate Toil Response 104
Use Open Source and Third-Party Tools 105
Use Feedback to Improve 105
Case Studies 106
Case Study 1: Reducing Toil in the Datacenter with Automation 107
Background 107
Problem Statement 110
What We Decided to Do 110
Design First Effort: Saturn Line-Card Repair 110
Implementation 111
Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card
Repair 113
Implementation 114
Lessons Learned 118
Case Study 2: Decommissioning Filer-Backed Home Directories 121
Background 121
Problem Statement 121
What We Decided to Do 122
Design and Implementation 123
Key Components 124
Lessons Learned 127
Conclusion 129

7. Simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . 131
Measuring Complexity 131
Simplicity Is End-to-End, and SREs Are Good for That 133
Case Study 1: End-to-End API Simplicity 134
Case Study 2: Project Lifecycle Complexity 134
Regaining Simplicity 135
Case Study 3: Simplification of the Display Ads Spiderweb 137
Case Study 4: Running Hundreds of Microservices on a Shared Platform 139
Case Study 5: pDNS No Longer Depends on Itself 140
Conclusion 141

Part II. Practices

8. On-Call. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Recap of “Being On-Call” Chapter of First SRE Book 148
Example On-Call Setups Within Google and Outside Google 149
Google: Forming a New Team 149
Evernote: Finding Our Feet in the Cloud 153
Practical Implementation Details 156
Anatomy of Pager Load 156
On-Call Flexibility 167
On-Call Team Dynamics 171
Conclusion 173

9. Incident Response. . . . . . . . . . . . . . . . . . . 175
Incident Management at Google 176
Incident Command System 176
Main Roles in Incident Response 177
Case Studies 177
Case Study 1: Software Bug—The Lights Are On but No One’s (Google)
Home 177
Case Study 2: Service Fault—Cache Me If You Can 180
Case Study 3: Power Outage—Lightning Never Strikes Twice…
Until It Does 185
Case Study 4: Incident Response at PagerDuty 188
Putting Best Practices into Practice 191

Incident Response Training 191
Prepare Beforehand 192
Drills 193
Conclusion 194

10. Postmortem Culture: Learning from Failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Case Study 196
Bad Postmortem 197
Why Is This Postmortem Bad? 199
Good Postmortem 203
Why Is This Postmortem Better? 212
Organizational Incentives 214
Model and Enforce Blameless Behavior 214
Reward Postmortem Outcomes 215
Share Postmortems Openly 217
Respond to Postmortem Culture Failures 218
Tools and Templates 220
Postmortem Templates 220
Postmortem Tooling 221
Conclusion 223

11. Managing Load. . . . . . . . . . . . . . . . . . . . . 225
Google Cloud Load Balancing 225
Anycast 226
Maglev 227
Global Software Load Balancer 229
Google Front End 229
GCLB: Low Latency 230
GCLB: High Availability 231
Case Study 1: Pokémon GO on GCLB 231
Autoscaling 236
Handling Unhealthy Machines 236
Working with Stateful Systems 237
Configuring Conservatively 237
Setting Constraints 2

12. Introducing Non-Abstract Large System Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
What Is NALSD? 245
Why “Non-Abstract”? 246
AdWords Example 246
Design Process 246
Initial Requirements 247
One Machine 248
Distributed System 251
Conclusion 260

13. Data Processing Pipelines. . . . . . . . . . . . 263
Pipeline Applications 264
Event Processing/Data Transformation to Order or Structure Data 264
Data Analytics 265
Machine Learning 265
Pipeline Best Practices 268
Define and Measure Service Level Objectives 268
Plan for Dependency Failure 270
Create and Maintain Pipeline Documentation 271
Map Your Development Lifecycle 272
Reduce Hotspotting and Workload Patterns 275
Implement Autoscaling and Resource Planning 276
Adhere to Access Control and Security Policies 277
Plan Escalation Paths 277
Pipeline Requirements and Design 277
What Features Do You Need? 278
Idempotent and Two-Phase Mutations 279
Checkpointing 279
Code Patterns 280
Pipeline Production Readiness 281
Pipeline Failures: Prevention and Response 284
Potential Failure Modes 284
Potential Causes 286
Case Study: Spotify 287
Event Delivery 288
Event Delivery System Design and Architecture 289
Event Delivery System Operation 290
Customer Integration and Support 293
Summary 298
Conclusion 299

14. Configuration Design and Best Practices.01
What Is Configuration? 301
Configuration and Reliability 302
Separating Philosophy and Mechanics 303
Configuration Philosophy 303
Configuration Asks Users Questions 305
Questions Should Be Close to User Goals 305
Mandatory and Optional Questions 306
Escaping Simplicity 308
Mechanics of Configuration 308
Separate Configuration and Resulting Data 308
Importance of Tooling 310
Ownership and Change Tracking 312
Safe Configuration Change Application 312
Conclusion 313

15. Configuration Specifics. . . . . . . . . . . . . . . 315
Configuration-Induced Toil 315
Reducing Configuration-Induced Toil 316
Critical Properties and Pitfalls of Configuration Systems 317
Pitfall 1: Failing to Recognize Configuration as a Programming Language
Problem 317
Pitfall 2: Designing Accidental or Ad Hoc Language Features 318
Pitfall 3: Building Too Much Domain-Specific Optimization 318
Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects” 319
Pitfall 5: Using an Existing General-Purpose Scripting Language Like
Python, Ruby, or Lua 319
Integrating a Configuration Language 320
Generating Config in Specific Formats 320
Driving Multiple Applications 321
Integrating an Existing Application: Kubernetes 322
What Kubernetes Provides 322
Example Kubernetes Config 322
Integrating the Configuration Language 323
Integrating Custom Applications (In-House Software) 326
Effectively Operating a Configuration System 329
Versioning 329
Source Control 330
Tooling 330
Testing 330
When to Evaluate Configuration 331
Very Early: Checking in the JSON 331
Middle of the Road: Evaluate at Build Time 332
Late: Evaluate at Runtime 332
Guarding Against Abusive Configuration 333
Conclusion 334

16. Canarying Releases. . . . . . . . . . . . . . . . . . 335
Release Engineering Principles 336
Balancing Release Velocity and Reliability 337
What Is Canarying? 338
Release Engineering and Canarying 338
Requirements of a Canary Process 339
Our Example Setup 339
A Roll Forward Deployment Versus a Simple Canary Deployment 340
Canary Implementation 342
Minimizing Risk to SLOs and the Error Budget 343
Choosing a Canary Population and Duration 343
Selecting and Evaluating Metrics 345
Metrics Should Indicate Problems 345
Metrics Should Be Representative and Attributable 346
Before/After Evaluation Is Risky 347
Use a Gradual Canary for Better Metric Selection 347
Dependencies and Isolation 348
Canarying in Noninteractive Systems 348
Requirements on Monitoring Data 349
Related Concepts 350
Blue/Green Deployment 350
Artificial Load Generation 350
Traffic Teeing 351
Conclusion 351

Part III. Processes

17. Identifying and Recovering from Overload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
From Load to Overload 356
Case Study 1: Work Overload When Half a Team Leaves 358
Background 358
Problem Statement 358
What We Decided to Do 359
Implementation 359
Lessons Learned 360
Case Study 2: Perceived Overload After Organizational and Workload
Changes 360
Background 360
Problem Statement 361
What We Decided to Do 362
Implementation 363
Effects 365
Lessons Learned 365
Strategies for Mitigating Overload 366
Recognizing the Symptoms of Overload 366
Reducing Overload and Restoring Team Health 367
Conclusion 369

18. SRE Engagement Model. . . . . . . . . . . . . . 371
The Service Lifecycle 372
Phase 1: Architecture and Design 372
Phase 2: Active Development 373
Phase 3: Limited Availability 373
Phase 4: General Availability 374
Phase 5: Deprecation 374
Phase 6: Abandoned 374
Phase 7: Unsupported 374
Setting Up the Relationship 375
Communicating Business and Production Priorities 375
Identifying Risks 375
Aligning Goals 375
Setting Ground Rules 379
Planning and Executing 379
Sustaining an Effective Ongoing Relationship 380
Investing Time in Working Better Together 380
Maintaining an Open Line of Communication 380
Performing Regular Service Reviews 381
Reassessing When Ground Rules Start to Slip 381
Adjusting Priorities According to Your SLOs and Error Budget 381
Handling Mistakes Appropriately 382
Scaling SRE to Larger Environments 382
Supporting Multiple Services with a Single SRE Team 382
Structuring a Multiple SRE Team Environment 383
Adapting SRE Team Structures to Changing Circumstances 384
Running Cohesive Distributed SRE Teams 384
Ending the Relationship 385
Case Study 1: Ares 385
Case Study 2: Data Analysis Pipeline 387
Conclusion 389

19. SRE: Reaching Beyond Your Walls. . . . . . 391
Truths We Hold to Be Self-Evident 391
Reliability Is the Most Important Feature 391
Your Users, Not Your Monitoring, Decide Your Reliability 392
If You Run a Platform, Then Reliability Is a Partnership 392
Everything Important Eventually Becomes a Platform 393
When Your Customers Have a Hard Time, You Have to Slow Down 393
You Will Need to Practice SRE with Your Customers 393
How to: SRE with Your Customers 394
Step 1: SLOs and SLIs Are How You Speak 394
Step 2: Audit the Monitoring and Build Shared Dashboards 395
Step 3: Measure and Renegotiate 396
Step 4: Design Reviews and Risk Analysis 396
Step 5: Practice, Practice, Practice 397
Be Thoughtful and Disciplined 397
Conclusion 398

20. SRE Team Lifecycles. . . . . . . . . . . . . . . . . 399
SRE Practices Without SREs 399
Starting an SRE Role 400
Finding Your First SRE 400
Placing Your First SRE 401
Bootstrapping Your First SRE 402
Distributed SREs 403
Your First SRE Team 403
Forming 404
Storming 405
Norming 408
Performing 411
Making More SRE Teams 413
Service Complexity 413
SRE Rollout 414
Geographical Splits 414
Suggested Practices for Running Many Teams 418
Mission Control 418
SRE Exchange 419
Training 419
Horizontal Projects 419
SRE Mobility 419
Travel 420
Launch Coordination Engineering Teams 420
Production Excellence 421
SRE Funding and Hiring 421
Conclusion 421

21. Organizational Change Management in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
SRE Embraces Change 423
Introduction to Change Management 424
Lewin’s Three-Stage Model 424
McKinsey’s 7-S Model 424
Kotter’s Eight-Step Process for Leading Change 425
The Prosci ADKAR Model 425
Emotion-Based Models 426
The Deming Cycle 426
How These Theories Apply to SRE 427
Case Study 1: Scaling Waze—From Ad Hoc to Planned Change 427
Background 427
The Messaging Queue: Replacing a System While Maintaining Reliability 427
The Next Cycle of Change: Improving the Deployment Process 429
Lessons Learned 431
Case Study 2: Common Tooling Adoption in SRE 432
Background 432
Problem Statement 433
What We Decided to Do 434
Design 434
Implementation: Monitoring 436
Lessons Learned 436
Conclusion 439
Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
A. Example SLO Document. . . . . . . . . . . . . . . 445
B. Example Error Budget Policy. . . . . . . . . . . 449
C. Results of Postmortem Analysis. . . . . . . . . 453
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455