
This article is a postmortem.
Not a tutorial.
Not a success story.
Just an honest write-up of how soc-core02 died, why it happened, and what I learned from it.
I’m writing this mostly for myself, but also for anyone building labs at home and thinking:
“It’s just a lab, what could go wrong?”
Well… quite a lot 😅
The Context
soc-core02 was my SOC core VM.
Ubuntu Server. Hardened. Firewall rules everywhere. SSH keys. Log ingestion.
It was supposed to be the stable brain of Operation Iron Watch 02.
I had already:
- Deployed a SIEM stack
- Forwarded logs from
web-arm01 - Started building baselines
- Documented everything carefully
At least… that’s what I thought.
The Trigger
The mistake started with overconfidence.
I wanted to “clean things up” and reinstall / overwrite parts of the SIEM stack (Wazuh components, indexer, dashboard).
Same VM. Same disk. Same system.
I didn’t:
- Snapshot the VM
- Fully uninstall previous components
- Stop and validate every dependent service first
I assumed:
“I’ve done it once, I know what I’m doing now.”
Classic.
The Symptoms
At first, it looked recoverable.
Then things escalated quickly:
- Endless logs flooding the terminal
- Services failing silently
- SSH becoming unstable
- System becoming unresponsive
- Eventually… total lock-up
I couldn’t even type commands anymore.
The system was alive, but unusable.
No graceful shutdown.
No clean recovery.
The Root Cause (Plainly Said)
I overwrote a complex SIEM stack on a hardened server without a rollback plan.
More specifically:
- Conflicting services (manager, indexer, dashboard)
- Disk and service state corruption
- Resource contention
- No clean separation between “experiment” and “production lab”
This wasn’t a Wazuh problem.
This wasn’t an Ubuntu problem.
This was an architectural mistake.
The Hard Truth
At some point, you have to stop troubleshooting and be honest:
“This system cannot be trusted anymore.”
I could maybe have forced a recovery.
But any detection results after that would be questionable.
So I did the hardest but cleanest thing:
I killed soc-core02.
No patching.
No duct tape.
No pretending.
The Decision: soc-core03
I wiped everything and started again with a clear rule:
One SOC core = one purpose = one clean lifecycle.
Thus was born soc-core03.
With clear lessons applied:
- Clean install, no leftovers
- Minimal base system
- Incremental build
- Validation at every step
- No overwriting critical components mid-operation
And most importantly:
- Architecture before tools
What This Failure Gave Me
Oddly enough, this failure was valuable.
It taught me:
- Why change management matters (even in labs)
- Why SOC stability is sacred
- Why “just testing” can still destroy systems
- Why documentation isn’t optional
- Why real SOC work is as much discipline as skill
This is exactly the kind of mistake that happens before you touch production — and that’s a good thing.
Why I’m Publishing This
I could have hidden this.
I could have pretended soc-core02 never existed.
But Cyberlandji is about learning in public, not showcasing perfection.
If you’re building labs and something breaks badly:
- You’re not stupid
- You’re not behind
- You’re learning the real lessons
SOC work is not clean.
Detection engineering is painful.
And architecture mistakes hurt — but they teach fast.
Closing Thoughts
soc-core02 failed so soc-core03 could be better.
Iron Watch 02 was paused.
The narrative was frozen.
And the rebuild was done properly.
If this post saves one person from overwriting a live SIEM without a snapshot, it did its job.
Back to building.
Back to learning.
Back to Iron Watch.
— Cyberlandji
