Don
Don A passionate cybersecurity presale, an avid knowledge work>_cod>_er.

Sentinel-Log-Ingestion-Resiliency

Sentinel-Log-Ingestion-Resiliency

Security logs ingestion from on-premises to Microsoft Sentinel.

Often time, we are hearing from our client on topic vis-a-vis know-how addressing security log ingestion capable of surviving intermittent or periodic network connectivity failure.

Read: 5mins, watch: 10mins, updated 4-Aug-24 9:13pm live.

Ingestion Survivability

This case study focuses on the resiliency features testing of the two native software agents purpose-built to handle the security logs ingestion from on-premises operating systems to Microsoft Sentinel, which attesting its robustness of retry mechanism during temporary or prolong network connectivity downtime.

In general, this is a valid and crucial concern ensuring there is no loss of security logs in the event of network connection failure during the ingestion. It serves as a guiding principle for a robust and resilient architectural design.

Overall Architecture

Overview of Sentinel log ingestion architecture.

Download original diagram in draw.io format here.

Greener lab

An eco-friendly way, a nano box was powered-on alongsides 6-month licensed Windows Server 2019 Standard with Hyper-V virtualized Ubuntu Server 22.04.4 LTS up and running. This test aimed to survive target 3 to 7 days long 24x7 uptimes with minimum carbon emisson and minimize risk of surging power utility bill at home ;)

Gears up

4.1 Windows Server 2019 Standard with Azure Connected Machine Agent
4.2 Ubuntu Server 22.04.4 with Azure Connected Machine Agent
4.3 Internet Connectivity
4.4 Microsoft Sentinel

  • Log Analytics Workspace
  • Syslog via AMA data connector
  • Data Collection Rule for Syslog for Azure Arc Machine
  • Data Collection Rule for Windows Event Log for Azure Arc Machine

Coverage

5.1 Windows Server 2019 Standard with Azure Connected Machine Agent suffering more than 9 hours network connectivity downtime.

  • According to Microsoft official guideline, data is buffered for up to 8.5 hours and maximum size of less than 1.5 GB before being discarded, hence, more than 9 hours downtime was identified and tested, can the logs re-transmission prevail?

5.2 Ubuntu Server 22.04.4 with Azure Connected Machine Agent suffering 3 days network connectivity downtime.

  • Let’s see how long the logs survive depend on the disk storage size as per Microsoft official guideline and let’s prove it to be trued.

Result

Network Downtime Hour Windows Server 2019 Ubuntu Server 22.04.4 Remark (GMT+8)
Phase 1 Test

187 (7 days 19 hours)

Recovered last 48 hours Windows Event

Missing two Syslog record


Blocked software agent network connetivity on 10/Jul/24 3:30PM

Resumed software agent network connectivity on 18/Jul/24 9:35AM

Windows Event first resumed ingestion on 18/Jul/24 10:45AM

Syslog first resumed ingestion on 10/Jul/24 6:00PM

Phase 2 Test

74 (3 days 2 hours)

Recovered last 48 hours Windows Event

Missing one Syslog record


Modified Event and Syslog message body with timestamp

Disconnected network cable on 19/Jul/24 12:48PM

Reconnected network cable on 22/Jul/24 14:58PM

Windows Event first resumed ingestion on 20/Jul/24 19:29:15

Syslog first resumed ingestion on Fri Jul 19 12:49:01 PM +08 2024

Phase 3 Test

10 (10 hours 36 minutes)

Resumed from last ingested Event, 100% recovered

Resumed from last ingested Syslog, 100% recovered


Disconnected network cable on 22/Jul/24 21:55PM

Reconnected network cable on 23/Jul/24 08:31AM

Windows Event first resumed ingestion on 21/Jul/24 07:34:01AM - 100% recovered

Syslog first resumed ingestion on Mon Jul 22 09:55:01 PM +08 2024 - 100% recovered


Garage

Microsoft Sentinel Windows Server 2019 Ubuntu Server 22.04.4 Remark (GMT+8)

Windows Agent Install Script

Windows Scheduler Event Write Every 5mins

Hyper-V Linux Virtual Machine Installation

Sentinel last ingested Windows Event

Linux Agent Install Script

Windows Event

Linux Cronjob Hourly Write

Sentinel last ingested Linux Syslog

Azure Arc Machines Overview

Windows Azure Agent Installation

Linux Syslog

Last ingested Windows Event

Windows Data Collection Rule

Windows Wireshark Detecting Outbound Azure Agent Traffic

Linux Agent Installation and Device Login

Last ingested Linux Syslog

Linux Data Collection Rule

Windows Hosts File Prevent Outbound Azure Agent Traffic

Linux Host File Prevent Outbound Azure Agent Traffic


Sentinel Phase 2 Test

2.1 Last ingested Windows Event on 19/Jul/24 12:48:19

2.2 Last ingested Linux Syslog on Fri Jul 19 12:47:01 PM +08 2024

Sentinel Phase 3 Test

3.1 Last ingested Windows Event on 21/Jul/24 07:33:59

3.2 Last ingested Linux Syslog on Mon Jul 22 09:54:01 PM +08 2024


Conclusion

The three phases of tests spread across two weeks, which conducted from 10 July to 23 July in 2024 produced pattern of results below.

8.1 Windows Azure Connected Machine Agent is capable to recover a maximum of last 24 hours Windows Event once network connectivity resumed and retry counter/polling interval reset every 8.5 hours.

  • A manual restart of Azure Connected Machine Agent is able to resume Windows Event reingestion rather than waiting for the retry counter/polling interval reset and reattempt
  • It has proven works and managed to resumed log retransmission from where it last stopped as per Phase 3 test result, which fully reingested as long as the Windows Event timestamp falls between last 24 hours

8.2 Linux Azure Connected Machine Agent is capable to recover almost 99.99% Syslog records (with 1 missing Syslog record during the one minute interval when network connection failure hits - refer to Phase 2 test result for detail).

  • How long the Syslog messages can be kept in queue (for the whole period of network connectivity failure) depends on Syslog server’s storage capacity, it’s recommended to perform a sanity check for any missing Syslogs record(s) at least for the first one minute internal when network connection downtime hits and perform a manual reingestion if necessary
  • Despite Phase 3 test result proven and recovered 100% of all Syslog records, it’s still recommended to check for any missing Syslog record(s) during the network connectivity downtime for the first one minute

All in all, it’s strongly recommended to have the redundance and resilience capabilities in the network connectivity implementation between on-premises and Sentinel. However, in the event of all failures (touch wood), we are prepared and aware what to lookout for symptoms and action to achieve a 100% log reingestion recovery plan.

comments powered by Disqus