دسترسی نامحدود
برای کاربرانی که ثبت نام کرده اند
برای ارتباط با ما می توانید از طریق شماره موبایل زیر از طریق تماس و پیامک با ما در ارتباط باشید
در صورت عدم پاسخ گویی از طریق پیامک با پشتیبان در ارتباط باشید
برای کاربرانی که ثبت نام کرده اند
درصورت عدم همخوانی توضیحات با کتاب
از ساعت 7 صبح تا 10 شب
دسته بندی: مدیریت سیستم ویرایش: نویسندگان: Betsy Beyer et al. (eds.) سری: ISBN (شابک) : 1492029505, 9781492029502 ناشر: O’Reilly سال نشر: 2018 تعداد صفحات: 508 زبان: English فرمت فایل : PDF (درصورت درخواست کاربر به PDF، EPUB یا AZW3 تبدیل می شود) حجم فایل: 14 مگابایت
در صورت تبدیل فایل کتاب The Site Reliability Workbook: Practical Ways to Implement SRE به فرمت های PDF، EPUB، AZW3، MOBI و یا DJVU می توانید به پشتیبان اطلاع دهید تا فایل مورد نظر را تبدیل نمایند.
توجه داشته باشید کتاب کتاب کارایی قابلیت اطمینان سایت: راه های عملی برای پیاده سازی SRE نسخه زبان اصلی می باشد و کتاب ترجمه شده به فارسی نمی باشد. وبسایت اینترنشنال لایبرری ارائه دهنده کتاب های زبان اصلی می باشد و هیچ گونه کتاب ترجمه شده یا نوشته شده به فارسی را ارائه نمی دهد.
در سال 2016، کتاب مهندسی قابلیت اطمینان سایت Google یک بحث صنعتی را در مورد معنای اجرای خدمات تولیدی امروز - و اینکه چرا ملاحظات قابلیت اطمینان برای طراحی خدمات اساسی است، برانگیخت. اکنون، مهندسان Google که روی آن پرفروش کار کردهاند، کتاب کار قابلیت اطمینان سایت را معرفی میکنند، یک همراه عملی که از نمونههای عینی استفاده میکند تا به شما نشان دهد چگونه اصول و شیوههای SRE را در محیط خود به کار ببرید.
این کتاب کار جدید نه تنها نمونه های عملی از تجربیات Google را ترکیب می کند، بلکه مطالعات موردی از مشتریان پلتفرم ابری Google را نیز ارائه می دهد که این سفر را پشت سر گذاشته اند. Evernote، The Home Depot، نیویورک تایمز، و سایر شرکتها تجربیاتی را که به سختی به دست آوردهاند، بیان میکنند که چه چیزی برای آنها مفید بوده و چه چیزی مفید نیست.
در این کتاب کار شیرجه بزنید و یاد بگیرید. چگونه می توانید روش SRE خود را، بدون توجه به اندازه شرکت شما، گسترش دهید.
شما خواهید آموخت:
In 2016, Google's Site Reliability Engineering book ignited an industry discussion on what it means to run production services today--and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google's experiences, but also provides case studies from Google's Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn't.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You'll learn:
Foreword I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Foreword II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii 1. How SRE Relates to DevOps. . . . . . . . . . . . . . 1 Background on DevOps 2 No More Silos 2 Accidents Are Normal 3 Change Should Be Gradual 3 Tooling and Culture Are Interrelated 3 Measurement Is Crucial 4 Background on SRE 4 Operations Is a Software Problem 4 Manage by Service Level Objectives (SLOs) 5 Work to Minimize Toil 5 Automate This Year’s Job Away 6 Move Fast by Reducing the Cost of Failure 6 Share Ownership with Developers 6 Use the Same Tooling, Regardless of Function or Job Title 7 Compare and Contrast 7 Organizational Context and Fostering Successful Adoption 9 Narrow, Rigid Incentives Narrow Your Success 9 It’s Better to Fix It Yourself; Don’t Blame Someone Else 10 Consider Reliability Work as a Specialized Role 10 When Can Substitute for Whether 11 Strive for Parity of Esteem: Career and Financial 12 2. Implementing SLOs. . . . . . . . . . . . . . . . . . . 17 Why SREs Need SLOs 17 Getting Started 18 Reliability Targets and Error Budgets 19 What to Measure: Using SLIs 20 A Worked Example 23 Moving from SLI Specification to SLI Implementation 25 Measuring the SLIs 26 Using the SLIs to Calculate Starter SLOs 28 Choosing an Appropriate Time Window 29 Getting Stakeholder Agreement 30 Establishing an Error Budget Policy 31 Documenting the SLO and Error Budget Policy 32 Dashboards and Reports 33 Continuous Improvement of SLO Targets 34 Improving the Quality of Your SLO 35 Decision Making Using SLOs and Error Budgets 37 Advanced Topics 38 Modeling User Journeys 39 Grading Interaction Importance 39 Modeling Dependencies 40 Experimenting with Relaxing Your SLOs 41 Conclusion 42 3. SLO Engineering Case Studies. . . . . . . . . . 43 Evernote’s SLO Story 43 Why Did Evernote Adopt the SRE Model? 44 Introduction of SLOs: A Journey in Progress 45 Breaking Down the SLO Wall Between Customer and Cloud Provider 48 Current State 49 The Home Depot’s SLO Story 49 The SLO Culture Project 50 Our First Set of SLOs 52 Evangelizing SLOs 54 Automating VALET Data Collection 55 The Proliferation of SLOs 57 Applying VALET to Batch Applications 57 Using VALET in Testing 58 Future Aspirations 58 Summary 59 Conclusion 60 4. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . 61 Desirable Features of a Monitoring Strategy 62 Speed 62 Calculations 62 Interfaces 63 Alerts 64 Sources of Monitoring Data 64 Examples 65 Managing Your Monitoring System 67 Treat Your Configuration as Code 67 Encourage Consistency 68 Prefer Loose Coupling 68 Metrics with Purpose 69 Intended Changes 70 Dependencies 70 Saturation 71 Status of Served Traffic 72 Implementing Purposeful Metrics 72 Testing Alerting Logic 72 Conclusion 73 5. Alerting on SLOs. . . . . . . . . . . . . . . . . . . . . 75 Alerting Considerations 75 Ways to Alert on Significant Events 76 1: Target Error Rate ≥ SLO Threshold 76 2: Increased Alert Window 78 3: Incrementing Alert Duration 79 4: Alert on Burn Rate 80 5: Multiple Burn Rate Alerts 82 6: Multiwindow, Multi-Burn-Rate Alerts 84 Low-Traffic Services and Error Budget Alerting 86 Generating Artificial Traffic 87 Combining Services 87 Making Service and Infrastructure Changes 87 Lowering the SLO or Increasing the Window 88 Extreme Availability Goals 89 Alerting at Scale 89 Conclusion 91 6. Eliminating Toil. . . . . . . . . . . . . . . . . . . . . . 93 What Is Toil? 94 Measuring Toil 96 Toil Taxonomy 98 Business Processes 98 Production Interrupts 99 Release Shepherding 99 Migrations 99 Cost Engineering and Capacity Planning 100 Troubleshooting for Opaque Architectures 100 Toil Management Strategies 101 Identify and Measure Toil 101 Engineer Toil Out of the System 101 Reject the Toil 101 Use SLOs to Reduce Toil 102 Start with Human-Backed Interfaces 102 Provide Self-Service Methods 102 Get Support from Management and Colleagues 103 Promote Toil Reduction as a Feature 103 Start Small and Then Improve 103 Increase Uniformity 103 Assess Risk Within Automation 104 Automate Toil Response 104 Use Open Source and Third-Party Tools 105 Use Feedback to Improve 105 Case Studies 106 Case Study 1: Reducing Toil in the Datacenter with Automation 107 Background 107 Problem Statement 110 What We Decided to Do 110 Design First Effort: Saturn Line-Card Repair 110 Implementation 111 Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair 113 Implementation 114 Lessons Learned 118 Case Study 2: Decommissioning Filer-Backed Home Directories 121 Background 121 Problem Statement 121 What We Decided to Do 122 Design and Implementation 123 Key Components 124 Lessons Learned 127 Conclusion 129 7. Simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . 131 Measuring Complexity 131 Simplicity Is End-to-End, and SREs Are Good for That 133 Case Study 1: End-to-End API Simplicity 134 Case Study 2: Project Lifecycle Complexity 134 Regaining Simplicity 135 Case Study 3: Simplification of the Display Ads Spiderweb 137 Case Study 4: Running Hundreds of Microservices on a Shared Platform 139 Case Study 5: pDNS No Longer Depends on Itself 140 Conclusion 141 Part II. Practices 8. On-Call. . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Recap of “Being On-Call” Chapter of First SRE Book 148 Example On-Call Setups Within Google and Outside Google 149 Google: Forming a New Team 149 Evernote: Finding Our Feet in the Cloud 153 Practical Implementation Details 156 Anatomy of Pager Load 156 On-Call Flexibility 167 On-Call Team Dynamics 171 Conclusion 173 9. Incident Response. . . . . . . . . . . . . . . . . . . 175 Incident Management at Google 176 Incident Command System 176 Main Roles in Incident Response 177 Case Studies 177 Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home 177 Case Study 2: Service Fault—Cache Me If You Can 180 Case Study 3: Power Outage—Lightning Never Strikes Twice… Until It Does 185 Case Study 4: Incident Response at PagerDuty 188 Putting Best Practices into Practice 191 Incident Response Training 191 Prepare Beforehand 192 Drills 193 Conclusion 194 10. Postmortem Culture: Learning from Failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Case Study 196 Bad Postmortem 197 Why Is This Postmortem Bad? 199 Good Postmortem 203 Why Is This Postmortem Better? 212 Organizational Incentives 214 Model and Enforce Blameless Behavior 214 Reward Postmortem Outcomes 215 Share Postmortems Openly 217 Respond to Postmortem Culture Failures 218 Tools and Templates 220 Postmortem Templates 220 Postmortem Tooling 221 Conclusion 223 11. Managing Load. . . . . . . . . . . . . . . . . . . . . 225 Google Cloud Load Balancing 225 Anycast 226 Maglev 227 Global Software Load Balancer 229 Google Front End 229 GCLB: Low Latency 230 GCLB: High Availability 231 Case Study 1: Pokémon GO on GCLB 231 Autoscaling 236 Handling Unhealthy Machines 236 Working with Stateful Systems 237 Configuring Conservatively 237 Setting Constraints 2 12. Introducing Non-Abstract Large System Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 What Is NALSD? 245 Why “Non-Abstract”? 246 AdWords Example 246 Design Process 246 Initial Requirements 247 One Machine 248 Distributed System 251 Conclusion 260 13. Data Processing Pipelines. . . . . . . . . . . . 263 Pipeline Applications 264 Event Processing/Data Transformation to Order or Structure Data 264 Data Analytics 265 Machine Learning 265 Pipeline Best Practices 268 Define and Measure Service Level Objectives 268 Plan for Dependency Failure 270 Create and Maintain Pipeline Documentation 271 Map Your Development Lifecycle 272 Reduce Hotspotting and Workload Patterns 275 Implement Autoscaling and Resource Planning 276 Adhere to Access Control and Security Policies 277 Plan Escalation Paths 277 Pipeline Requirements and Design 277 What Features Do You Need? 278 Idempotent and Two-Phase Mutations 279 Checkpointing 279 Code Patterns 280 Pipeline Production Readiness 281 Pipeline Failures: Prevention and Response 284 Potential Failure Modes 284 Potential Causes 286 Case Study: Spotify 287 Event Delivery 288 Event Delivery System Design and Architecture 289 Event Delivery System Operation 290 Customer Integration and Support 293 Summary 298 Conclusion 299 14. Configuration Design and Best Practices.01 What Is Configuration? 301 Configuration and Reliability 302 Separating Philosophy and Mechanics 303 Configuration Philosophy 303 Configuration Asks Users Questions 305 Questions Should Be Close to User Goals 305 Mandatory and Optional Questions 306 Escaping Simplicity 308 Mechanics of Configuration 308 Separate Configuration and Resulting Data 308 Importance of Tooling 310 Ownership and Change Tracking 312 Safe Configuration Change Application 312 Conclusion 313 15. Configuration Specifics. . . . . . . . . . . . . . . 315 Configuration-Induced Toil 315 Reducing Configuration-Induced Toil 316 Critical Properties and Pitfalls of Configuration Systems 317 Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem 317 Pitfall 2: Designing Accidental or Ad Hoc Language Features 318 Pitfall 3: Building Too Much Domain-Specific Optimization 318 Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects” 319 Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua 319 Integrating a Configuration Language 320 Generating Config in Specific Formats 320 Driving Multiple Applications 321 Integrating an Existing Application: Kubernetes 322 What Kubernetes Provides 322 Example Kubernetes Config 322 Integrating the Configuration Language 323 Integrating Custom Applications (In-House Software) 326 Effectively Operating a Configuration System 329 Versioning 329 Source Control 330 Tooling 330 Testing 330 When to Evaluate Configuration 331 Very Early: Checking in the JSON 331 Middle of the Road: Evaluate at Build Time 332 Late: Evaluate at Runtime 332 Guarding Against Abusive Configuration 333 Conclusion 334 16. Canarying Releases. . . . . . . . . . . . . . . . . . 335 Release Engineering Principles 336 Balancing Release Velocity and Reliability 337 What Is Canarying? 338 Release Engineering and Canarying 338 Requirements of a Canary Process 339 Our Example Setup 339 A Roll Forward Deployment Versus a Simple Canary Deployment 340 Canary Implementation 342 Minimizing Risk to SLOs and the Error Budget 343 Choosing a Canary Population and Duration 343 Selecting and Evaluating Metrics 345 Metrics Should Indicate Problems 345 Metrics Should Be Representative and Attributable 346 Before/After Evaluation Is Risky 347 Use a Gradual Canary for Better Metric Selection 347 Dependencies and Isolation 348 Canarying in Noninteractive Systems 348 Requirements on Monitoring Data 349 Related Concepts 350 Blue/Green Deployment 350 Artificial Load Generation 350 Traffic Teeing 351 Conclusion 351 Part III. Processes 17. Identifying and Recovering from Overload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 From Load to Overload 356 Case Study 1: Work Overload When Half a Team Leaves 358 Background 358 Problem Statement 358 What We Decided to Do 359 Implementation 359 Lessons Learned 360 Case Study 2: Perceived Overload After Organizational and Workload Changes 360 Background 360 Problem Statement 361 What We Decided to Do 362 Implementation 363 Effects 365 Lessons Learned 365 Strategies for Mitigating Overload 366 Recognizing the Symptoms of Overload 366 Reducing Overload and Restoring Team Health 367 Conclusion 369 18. SRE Engagement Model. . . . . . . . . . . . . . 371 The Service Lifecycle 372 Phase 1: Architecture and Design 372 Phase 2: Active Development 373 Phase 3: Limited Availability 373 Phase 4: General Availability 374 Phase 5: Deprecation 374 Phase 6: Abandoned 374 Phase 7: Unsupported 374 Setting Up the Relationship 375 Communicating Business and Production Priorities 375 Identifying Risks 375 Aligning Goals 375 Setting Ground Rules 379 Planning and Executing 379 Sustaining an Effective Ongoing Relationship 380 Investing Time in Working Better Together 380 Maintaining an Open Line of Communication 380 Performing Regular Service Reviews 381 Reassessing When Ground Rules Start to Slip 381 Adjusting Priorities According to Your SLOs and Error Budget 381 Handling Mistakes Appropriately 382 Scaling SRE to Larger Environments 382 Supporting Multiple Services with a Single SRE Team 382 Structuring a Multiple SRE Team Environment 383 Adapting SRE Team Structures to Changing Circumstances 384 Running Cohesive Distributed SRE Teams 384 Ending the Relationship 385 Case Study 1: Ares 385 Case Study 2: Data Analysis Pipeline 387 Conclusion 389 19. SRE: Reaching Beyond Your Walls. . . . . . 391 Truths We Hold to Be Self-Evident 391 Reliability Is the Most Important Feature 391 Your Users, Not Your Monitoring, Decide Your Reliability 392 If You Run a Platform, Then Reliability Is a Partnership 392 Everything Important Eventually Becomes a Platform 393 When Your Customers Have a Hard Time, You Have to Slow Down 393 You Will Need to Practice SRE with Your Customers 393 How to: SRE with Your Customers 394 Step 1: SLOs and SLIs Are How You Speak 394 Step 2: Audit the Monitoring and Build Shared Dashboards 395 Step 3: Measure and Renegotiate 396 Step 4: Design Reviews and Risk Analysis 396 Step 5: Practice, Practice, Practice 397 Be Thoughtful and Disciplined 397 Conclusion 398 20. SRE Team Lifecycles. . . . . . . . . . . . . . . . . 399 SRE Practices Without SREs 399 Starting an SRE Role 400 Finding Your First SRE 400 Placing Your First SRE 401 Bootstrapping Your First SRE 402 Distributed SREs 403 Your First SRE Team 403 Forming 404 Storming 405 Norming 408 Performing 411 Making More SRE Teams 413 Service Complexity 413 SRE Rollout 414 Geographical Splits 414 Suggested Practices for Running Many Teams 418 Mission Control 418 SRE Exchange 419 Training 419 Horizontal Projects 419 SRE Mobility 419 Travel 420 Launch Coordination Engineering Teams 420 Production Excellence 421 SRE Funding and Hiring 421 Conclusion 421 21. Organizational Change Management in SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 SRE Embraces Change 423 Introduction to Change Management 424 Lewin’s Three-Stage Model 424 McKinsey’s 7-S Model 424 Kotter’s Eight-Step Process for Leading Change 425 The Prosci ADKAR Model 425 Emotion-Based Models 426 The Deming Cycle 426 How These Theories Apply to SRE 427 Case Study 1: Scaling Waze—From Ad Hoc to Planned Change 427 Background 427 The Messaging Queue: Replacing a System While Maintaining Reliability 427 The Next Cycle of Change: Improving the Deployment Process 429 Lessons Learned 431 Case Study 2: Common Tooling Adoption in SRE 432 Background 432 Problem Statement 433 What We Decided to Do 434 Design 434 Implementation: Monitoring 436 Lessons Learned 436 Conclusion 439 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 A. Example SLO Document. . . . . . . . . . . . . . . 445 B. Example Error Budget Policy. . . . . . . . . . . 449 C. Results of Postmortem Analysis. . . . . . . . . 453 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455