Monitoring and telemetry - Part 2
In the prevoius post , we understood the features in out of box telemetry. This post emphasizes the need for understanding the requirements to configure telemetry.
What is need for gathering telemetry requirements?
Defining the correct set of requirements is necessary so the solution is aligned with the teams objective for maintaining the health of the applications.
Understand the Business Objective:
Identifying and prioritise an overarching objective so the solution designed for monitoring supports the requirement.
Reduce downtime and improve system stability
Optimise performance
Gain visibility to key business process
Monitor compliance with SLA
Identify the key stake holders:
The monitoring solution should benefit the stakeholders and thus they need to be identified before designing the solution.
IT Operations team
Development team
Product managers
Customer support team
Leadership
Questions asked for better solution.
What is the crucial information for your role?
What are the pain points in your process ?
How do you plan to use the telemetry solution in your process?
Identify the use case:
In addition to questions, insights on the use cases will ensure the moniroting solution meets the real world need.
Monitor a particular API success or failure rate
Alert if the response time increases
Track customer interactions
Measure resource consumption for cost monitoring
Monitor P50, P95,P99 for Purchase order processing (P50 for instance is a metric to measure the time taken for porcessing PO - from submitting PR ,approval and payment. P95 tracks the status level of PO . P99 is a metric indicates the time within which 99% of requests are completed. If the P99 latency is 100ms, it means 99% of requests were served within 100 milliseconds, and 1% took longer )
Determine metrics and KPIs:
Identify and describe the essential metrics and key performance indicators (KPIs) that offer visibility into the overall health and efficiency of your application.
Performance Metrics – Include response time, throughput, and latency, which reflect the speed and responsiveness of the system.
Error Metrics – Comprise exception rates and failed transaction counts, indicating system stability and reliability.
Business Metrics – Cover conversion rates and revenue impact, linking system performance to business outcomes.
Infrastructure Metrics – Encompass CPU usage, memory consumption, and disk I/O, highlighting the state of the underlying technical environment
Alert notification:
Determine the scenarios that should trigger alerts and establish a clear approach for managing notifications effectively.
Best Practices for Alerting:
Define Thresholds Carefully: Set alert thresholds that strike a balance between being responsive and avoiding noise from false positives.
Apply Severity Levels: Categorize alerts by importance to ensure that critical issues receive immediate attention.
Establish Escalation Procedures: Create escalation paths for alerts that remain unresolved, ensuring timely resolution.
Notification Channels:
Email
Text Messages (SMS)
Collaboration Tools: Integrations with Microsoft Teams, Slack, or similar platforms for real-time communication.
Data retention and privacy:
Establish guidelines for how long telemetry data should be retained, ensuring alignment with data privacy regulations and organizational policies.
Retention Periods:
Set retention timelines based on business requirements, compliance obligations, and audit needs.
Factor in storage costs and define an archiving strategy to manage long-term data efficiently.
Implement policies for data purging or archiving to optimize system performance and cost-effectiveness.
Privacy Considerations:
Verify that any personally identifiable information (PII) is excluded from telemetry data or anonymized in accordance with privacy standards.
Conduct regular audits of data collection and storage practices to ensure regulatory compliance (e.g., GDPR).
Enforce strict access controls and data governance policies to protect sensitive information.
Visual reports:
Establish how telemetry data should be visualized to deliver meaningful, actionable insights.
Dashboards: Design role-based dashboards tailored to different audiences—such as operational teams and business stakeholders—for real-time monitoring and decision-making.
Reports: Outline requirements for scheduled reporting, such as weekly SLA compliance summaries or performance trend analyses.
Custom Queries: Leverage tools like Azure Data Explorer to run interactive, ad hoc queries that enable deep analysis and investigation of telemetry data.
Adhere to organization's governance policies:
Telemetry Data Access Management: Implement strict access controls to ensure that only authorized users can view or manage telemetry data.
Monitoring Configuration Audits: Maintain detailed audit logs to track changes made to monitoring and alerting setups for accountability and compliance.
Cost Control and Budget Oversight: Monitor and manage spending on telemetry and monitoring solutions to stay within budget and avoid unexpected expenses.
Documentation:
Documents all the above gathered requirements so these will be handy when designing and solutioning for monitoring.
References:
https://learn.microsoft.com/en-us/azure/architecture/best-practices/monitoring
🏆 Microsoft MVP 🏆, Microsoft Dynamics 365 For Finance and Operations Senior Solutions Architect, Azure, Power Platform, Copilot Studio and Azure-AI enthusiast, Global Speaker 🎙️🎙️ Microsoft Community Leader
1moHelpful insights 👌