Sentinel-Log-Ingestion-Resiliency
Security logs ingestion from on-premises to Microsoft Sentinel.
Summary
Often time, we are hearing from our client on topic vis-a-vis know-how addressing security log ingestion capable of surviving intermittent or periodic network connectivity failure.
Read: 5mins, watch: 10mins, updated 4-Aug-24 9:13pm live.
Ingestion Survivability
This case study focuses on the resiliency features
testing of the two native software agents purpose-built to handle the security logs ingestion from on-premises operating systems to Microsoft Sentinel, which attesting its robustness of retry mechanism
during temporary or prolong network connectivity downtime.
In general, this is a valid and crucial concern ensuring there is no loss of security logs in the event of network connection failure during the ingestion. It serves as a guiding principle for a robust and resilient architectural design.
Overall Architecture
Overview of Sentinel log ingestion architecture.
Download original diagram in draw.io format here.
Greener lab
An eco-friendly way, a nano box was powered-on alongsides 6-month licensed Windows Server 2019 Standard with Hyper-V virtualized Ubuntu Server 22.04.4 LTS up and running. This test aimed to survive target 3 to 7 days long 24x7 uptimes with minimum carbon emisson and minimize risk of surging power utility bill at home ;)
Gears up
4.1 Windows Server 2019 Standard with Azure Connected Machine Agent
4.2 Ubuntu Server 22.04.4 with Azure Connected Machine Agent
4.3 Internet Connectivity
4.4 Microsoft Sentinel
- Log Analytics Workspace
- Syslog via AMA data connector
- Data Collection Rule for Syslog for Azure Arc Machine
- Data Collection Rule for Windows Event Log for Azure Arc Machine
Coverage
5.1 Windows Server 2019 Standard with Azure Connected Machine Agent suffering more than 9 hours network connectivity downtime.
- According to Microsoft official guideline, data is buffered for up to 8.5 hours and maximum size of less than 1.5 GB before being discarded, hence, more than 9 hours downtime was identified and tested, can the logs re-transmission prevail?
5.2 Ubuntu Server 22.04.4 with Azure Connected Machine Agent suffering 3 days network connectivity downtime.
- Let’s see how long the logs survive depend on the disk storage size as per Microsoft official guideline and let’s prove it to be trued.
Result
Network Downtime Hour | Windows Server 2019 | Ubuntu Server 22.04.4 | Remark (GMT+8) |
---|---|---|---|
Phase 1 Test 187 (7 days 19 hours) |
Recovered last 48 hours Windows Event |
Missing two Syslog record |
|
Phase 2 Test 74 (3 days 2 hours) |
Recovered last 48 hours Windows Event |
Missing one Syslog record |
|
Phase 3 Test 10 (10 hours 36 minutes) |
Resumed from last ingested Event, 100% recovered |
Resumed from last ingested Syslog, 100% recovered |
|
Garage
Microsoft Sentinel | Windows Server 2019 | Ubuntu Server 22.04.4 | Remark (GMT+8) | |
---|---|---|---|---|
Windows Agent Install Script |
Windows Scheduler Event Write Every 5mins |
Hyper-V Linux Virtual Machine Installation |
Sentinel last ingested Windows Event |
|
Linux Agent Install Script |
Windows Event |
Linux Cronjob Hourly Write |
Sentinel last ingested Linux Syslog |
|
Azure Arc Machines Overview |
Windows Azure Agent Installation |
Linux Syslog |
Last ingested Windows Event |
|
Windows Data Collection Rule |
Windows Wireshark Detecting Outbound Azure Agent Traffic |
Linux Agent Installation and Device Login |
Last ingested Linux Syslog |
|
Linux Data Collection Rule |
Windows Hosts File Prevent Outbound Azure Agent Traffic |
Linux Host File Prevent Outbound Azure Agent Traffic |
|
Conclusion
The three phases of tests spread across two weeks, which conducted from 10 July to 23 July in 2024 produced pattern of results below.
8.1 Windows Azure Connected Machine Agent is capable to recover a maximum of last 24 hours Windows Event once network connectivity resumed and retry counter/polling interval reset every 8.5 hours.
- A manual restart of Azure Connected Machine Agent is able to resume Windows Event reingestion rather than waiting for the retry counter/polling interval reset and reattempt
- It has proven works and managed to resumed log retransmission from where it last stopped as per Phase 3 test result, which fully reingested as long as the Windows Event timestamp falls between last 24 hours
8.2 Linux Azure Connected Machine Agent is capable to recover almost 99.99% Syslog records (with 1 missing Syslog record during the one minute interval when network connection failure hits - refer to Phase 2 test result for detail).
- How long the Syslog messages can be kept in queue (for the whole period of network connectivity failure) depends on Syslog server’s storage capacity, it’s recommended to perform a sanity check for any missing Syslogs record(s) at least for the first one minute internal when network connection downtime hits and perform a manual reingestion if necessary
- Despite Phase 3 test result proven and recovered 100% of all Syslog records, it’s still recommended to check for any missing Syslog record(s) during the network connectivity downtime for the first one minute
All in all, it’s strongly recommended to have the redundance and resilience capabilities in the network connectivity implementation between on-premises and Sentinel. However, in the event of all failures (touch wood), we are prepared and aware what to lookout for symptoms and action to achieve a 100% log reingestion recovery plan.