Troubelshooting
At blocktorch we aim at being a reliable partner of our users throughout the whole software development lifecycle, and also equip engineers with the right tooling and data insights when troubleshooting. Below is an outline of steps that can be taken during the troubleshooting process and how blocktorch's toolkit can be of help during the process. Our mission is to help making web3 more reliable and secure for everyone involved.
1. Initial Assessment
Identify Symptoms: Determine if the issue is a downtime or a security breach.
Building the relevant monitors in blocktorch can be crucial in identifying systems as fast as possible. Blocktorch ships some KPIs and visuals out of the box, which can help in identifying issues, but you know your software better than us, so building custom monitors is an important practice. From the charts you can directly navigate to the related logs by clicking the data points in the chart you need to investigate further
Scope of Impact: Assess the extent—how many services, users, or systems are affected.
Looking at the stack traces and invocation flows of the relevant logs can help figure out bottlenecks and affected services
We highly recommend also making use of blocktorch's frontend Dragon SDK to get deeper insights on client side issues
2. Communication
Notify Stakeholders: Inform the relevant team members and stakeholders about the issue.
When your team members and stakeholders are part of your blocktorch project, they can receive monitor alerts proactively
External Communication: If necessary, prepare a communication plan for customers or the public.
Home dashboards as well as search queries are shareable also with external stakeholders, so your community of users can get informed as well
3. Isolation
Isolate Affected Systems: To prevent further damage, isolate the compromised or malfunctioning components.
Limit Access: Restrict access to sensitive systems until the nature and scope of the issue are understood.
you can disable functionalities in your UI
if your smart contracts are built with the functionality to pause functions, you can think of doing so
4. Investigation
Review Logs: Check application, security, and system logs for anomalies or indicators of the cause by utilizing blocktorch's search
Identify Vulnerabilities: Look for any vulnerabilities or errors that might have led to the issue.
Your smart contract code can be directly accessed in blocktorch's contract details page
Blocktorch's step debugger can help you find the exact line of code causing the vulnerability or error
If your application is using Oracles and you believe the root cause could be there, you can check the Oracle's out of the box details
5. Mitigation
Patch and Update: Apply necessary patches or updates to software to mitigate the vulnerability or error.
We are aware that this can be especially hard when the root cause lies within the smart contract, unless your project utilizes upgradable contract architecture
Make aware your users: If a security breach is confirmed, prompt users to not sign any malicious smart contract interactions
6. Recovery
Restore Services: Gradually restore services, ensuring they are fully sanitized and secure.
To test the services locally you can leverage blocktorch's managed hardhat forks
7. Postmortem Analysis
Analyze Causes: Thoroughly document what happened, why it happened, and how it was resolved.
Review Processes: Evaluate and update security policies, response strategies, and monitoring techniques to prevent future incidents.
8. Ongoing Monitoring
Continuous Monitoring: Implement additional custom real-time monitoring in blocktorch to detect future issues promptly
Regular Audits: Schedule regular security audits to ensure ongoing compliance and security.
Last updated