As a fintech company, the reliability of our products and services is critical to ensuring that Discover® Financial Services customers can access their money, credit, and accounts when they need to. Like many other companies, Discover has adopted a product-centric approach where the responsibility for reliability sits with the product teams.
My team at Discover Financial Services is responsible for digital payments. In order to create a culture where product teams own the reliability of their offerings, the Digital Payments team made seven instrumental changes:
- Define, measure, and reveal product goals: We broke down and extended key reliability goals to each product area. Each product team is given a scorecard with metrics around risks, vulnerabilities, observability, and automation levels.
- Map product flows: Product flows can get out of sync over time, so taking time before we started our “ops-in-squads” model to evaluate our product flows and where there were gaps in reliability engineering helped us to be most effective. Going through this exercise with teams created a more in-depth understanding, which led to more success.
- Ensure chaos testing: Implementing chaos testing and game-day simulation helped us spot and fix observability problems and correct logging practices that were creating noise to mask true alerts. These activities created a great deal of energy and awareness around availability.
- Clearly define service level objectives: SREs worked with product teams to define SLOs for services and build observability to guarantee the SLOs are met. Teams are responsible for ensuring proper logging coverage, establishing monitoring practices, and creating alerts.
- Create Ops-Readiness Guidelines: Because each product team is responsible for the reliability of their products, we created a list of guidelines that articulate the expectations clearly around logging, service monitoring, alerting, deployment, transaction retries, recoverability plans, automated production validation, certificate rotation plan, chaos testing, and more.
- Plan for capacity management: One of the more critical changes we made was to better plan for capacity management and performance testing via quarterly capacity management reviews. These reviews bring the right people together to determine projected average and peak traffic patterns and volumes six to 12 months in advance for each customer. This helps ensure that systems are ready for any peak times.
- Make teams directly responsible for product support: We went from a centralized support system to making teams accountable for their product support. Actively having to support late-night calls related to product issues creates a stronger team focus on designing for reliability.
These seven changes were critical in helping the Digital Payments team create a culture where each product team owned the reliability of their products. This has resulted in teams that feel more empowered to enable reliability and in products that are more secure and compliant.