When a Linux server slows down, the problem is usually not one thing. It is CPU pressure, memory churn, disk space, a dead service, or a log file quietly filling with errors while nobody is looking. That is why system health checks matter, and why scripting them with Linux Bash is one of the most practical forms of automation an administrator can build.
Certified Ethical Hacker (CEH) v13
Learn essential ethical hacking skills to identify vulnerabilities, strengthen security measures, and protect organizations from cyber threats effectively
Get this course on Udemy at the lowest price →This post shows how to automate Linux health checks with Bash scripts in a way that is lightweight, portable, and useful in real operations. You will see what to monitor, how to structure the script, how to report problems clearly, and how to run checks on a schedule without creating noise. The goal is simple: better visibility, faster troubleshooting, and fewer surprises.
Why Automate Linux Health Checks
Manual checks work when you have one or two servers and plenty of time. They break down fast when you are responsible for multiple environments, especially if each system has a different role. One admin might check disk space on Monday, memory on Tuesday, and services only after a user complains. That kind of inconsistency is exactly how outages sneak in.
Automation fixes that by making the same checks run the same way every time. It also helps you catch drift early. For example, a server can stay technically online while disk usage climbs from 78% to 93% over a few days. A scheduled Bash script will flag that trend before logins fail or applications start throwing write errors.
Consistent checks are more valuable than occasional deep dives. Most Linux failures give warning signs first. A script that runs on a schedule is far better at catching those signs than a human who checks only after an alert or incident.
This is especially useful in small teams. One administrator can manage web servers, database hosts, and internal utilities without manually logging into each box. The script becomes a repeatable standard across systems, which is exactly the kind of disciplined operating practice reflected in frameworks like NIST Cybersecurity Framework and the workload expectations described by the U.S. Bureau of Labor Statistics for network and systems administrators.
For a security-minded admin, health checks also fit well with the defensive mindset taught in the CEH v13 course. If you know what normal looks like, it is much easier to spot misconfigurations, suspicious services, and abnormal log activity before they become a problem.
Core Health Metrics To Monitor
A useful Linux health check script does not need to inspect everything. It needs to inspect the right things for the server role. A web server, for example, cares about CPU load, memory pressure, nginx availability, disk space, and network reachability. A database host may care more about swap activity, filesystem growth, and storage latency patterns.
CPU, load, and memory pressure
CPU usage and load average are often the first indicators of trouble. High load does not always mean high CPU use. It can also mean processes are stuck waiting on disk or blocked on another resource. That is why a load average should be interpreted against the number of CPU cores, not in isolation.
Memory usage is just as important. Linux uses free memory aggressively for cache, so the available field from free -h is more useful than raw free memory. Heavy swap activity usually means the system is under memory pressure. That can slow applications down dramatically and may lead to instability if it continues.
- CPU: sustained high load, runaway processes, or stuck tasks
- Memory: low available RAM, aggressive caching shifts, swap usage
- Disk: high space use, inode exhaustion, log growth
- Services: failed daemons such as sshd, nginx, cron, or databases
- Network: interface failures, packet loss, DNS failures, endpoint reachability
- Logs: repeated errors, denied access, authentication failures, kernel warnings
These checks align well with operational guidance found in vendor documentation such as Microsoft Learn and official Linux references like Red Hat Linux resources. The specific commands vary by distro, but the operational logic stays the same.
Planning A Bash-Based Health Check Script
The best health check script is focused. Do not try to build a universal monitor for every possible host on day one. A file server, a reverse proxy, and a database machine have different failure modes. Define the scope first, then decide what counts as healthy for that role.
Start by deciding what the script will do with its output. A quick terminal report is useful during testing, but production checks usually need one or more of these:
- Write to a log file for history and troubleshooting
- Print summary output for cron mail or manual review
- Send alerts when thresholds are exceeded
Next, choose the commands you will rely on. For portability, favor tools found on most Linux distributions:
uptimeor/proc/loadavgfor load and uptimefree -hor/proc/meminfofor memorydf -handdf -ifor storage and inode usesystemctlfor service status on systemd hostsssfor listening ports and active connectionspingfor network reachabilityjournalctlfor recent log review
Pro Tip
Build thresholds into variables, not hard-coded values. A database server and a small utility VM rarely need the same warning levels for disk, memory, or load.
This kind of structured operations thinking also matches the kind of system administration discipline reflected in Linux Foundation materials and common enterprise hardening guidance such as the CIS Benchmarks. Those benchmarks are not health checks themselves, but they reinforce the same principle: define a baseline and automate verification against it.
Setting Up The Script Structure
A Bash health check script should be easy to read six months later, not just functional today. Start with a shebang and safety settings that reduce silent failures. In many scripts, set -euo pipefail is a good baseline, though you should understand how each option behaves before using it everywhere.
Then divide the script into functions. That gives you reusable blocks, clearer error handling, and easier testing. A simple structure might include check_cpu, check_memory, check_disk, check_services, and check_network. Put thresholds and service names in variables near the top so changes are simple and visible.
#!/usr/bin/env bash
set -euo pipefail
HOSTNAME_SHORT="$(hostname -s)"
WARN_DISK=80
CRIT_DISK=90
SERVICES=("sshd" "nginx" "cron")
check_disk() {
df -h
}
main() {
check_disk
}
main "$@"
Add a header comment block that explains purpose, usage, and permissions. If the script reads logs or checks system services, it may need elevated rights. Store the final version somewhere standard like /usr/local/bin, then make it executable with chmod +x.
Official shell behavior and command syntax references are easy to verify through the GNU Bash Manual. For service management on systemd-based hosts, the systemctl documentation is the authority.
Checking CPU Load And Uptime
CPU load tells you how much work is waiting to be processed. The uptime command is a quick way to see current load averages for the past 1, 5, and 15 minutes. That trend matters. A single spike is not the same thing as a sustained problem.
Example logic is straightforward: compare the 1-minute load to CPU core count. On a 4-core machine, a load average around 4 can be normal under steady use. A load of 12 for several minutes is a different story. That may indicate a runaway process, storage bottleneck, or some other resource wait that is making tasks pile up.
A script can also use /proc/loadavg for a very lightweight read. If the value crosses your threshold, print a clear warning and include context such as recent process activity. Commands like ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head help identify the top consumers.
- Short spike: may be normal during backup jobs or batch processing
- Sustained high load: usually indicates a real operational issue
- High load with low CPU usage: often points to I/O wait or blocked processes
That distinction matters. The value is not in just reporting load; it is in helping the admin decide whether to investigate further. If you are writing the report cleanly, you are already improving system health response time.
Checking Memory And Swap Usage
Memory checks are one of the most useful parts of automation because memory problems often build slowly. The free -h command gives a readable snapshot of total, used, free, shared, buff/cache, and available memory. The available value is the one administrators usually care about most because it estimates how much RAM can still be used without pressure.
Swap usage is the other key signal. Some systems use a little swap without any real problem. The warning sign is heavy, ongoing swapping paired with low available memory. That means the kernel is moving pages in and out of RAM constantly, which can drag response time down across the entire host.
A practical check can look like this:
- Read memory totals from
free -h - Compare available memory to a warning threshold
- Check whether swap used is above a set limit
- If thresholds are exceeded, print the top memory consumers
For context, a useful follow-up command is:
ps -eo pid,comm,%mem --sort=-%mem | head -n 10
That shows which processes are consuming the most memory at the moment. For deeper analysis, some environments use smem, though it may not be installed everywhere. The official Linux memory documentation in kernel resources and distribution docs is a better source than guessing based on a single command output.
Note
Do not treat zero free memory as failure on Linux. The kernel uses RAM for caching on purpose. Focus on available memory and swap pressure instead.
Checking Disk Space And Filesystem Health
Disk issues are some of the easiest problems to prevent and some of the most annoying to recover from. A filesystem can fail even before it reaches 100% capacity if logs grow too fast, temp space fills up, or inodes run out. That is why df -h is necessary but not enough on its own.
Use df -i to check inode usage. Inode exhaustion happens when the filesystem still has space but has run out of file entries. This is common in directories that store millions of small files, cache data, or temporary application artifacts. A server can look healthy by space percentage and still fail to create new files.
High-risk paths usually include:
- /var/log for runaway log files
- /tmp and /var/tmp for temporary clutter
- Application data directories for uploads, queues, or cache
- Database volumes for transaction growth
- Mounted network filesystems if availability depends on remote storage
A simple Bash check can parse df output and warn at defined thresholds, such as 80% for warning and 90% for critical. That gives you room to act before the filesystem becomes full. For shared storage, also verify the mount itself is present. A mounted path that silently disappears is a different kind of outage than a nearly full disk.
For broader storage guidance, vendor and standards sources such as Red Hat documentation and CIS provide operational hardening advice that pairs well with routine health checks.
Verifying Critical Services
A Linux host is only useful if its required services are running. A web server without nginx, a remote admin box without sshd, or a scheduled job host with cron stopped is not healthy, even if the machine is powered on. Service checks are one of the fastest ways to catch hidden failure.
On systemd systems, systemctl is-active is the cleanest way to check status. You can maintain a configurable list of services based on role. For example, a web server might check sshd, nginx, and cron, while a database host checks the database daemon plus time sync and backup-related services.
for svc in "${SERVICES[@]}"; do
if systemctl is-active --quiet "$svc"; then
echo "OK: $svc is running"
else
echo "CRITICAL: $svc is not active"
systemctl status "$svc" --no-pager -l
fi
done
If your environment includes older distributions or minimal installations, you may need compatibility with service or init scripts. That is why portability planning matters. Health checks should not fail just because the host uses a different init system.
When a service is inactive, capture failure details for later review. The most helpful reports do not just say “failed”; they include unit status, timestamp, and recent log lines. That gives the admin a place to start instead of making them re-run commands manually.
For service management details, the official references are the systemctl manual and distribution docs from vendors such as Red Hat.
Reviewing Logs For Warning Signs
Logs tell you what the system was trying to say before something failed. A good health check script does not dump entire logs into a report. It scans for the high-value signs: error, failed, critical, panic, denied, authentication failures, and repeated warnings in a recent time window.
On systemd hosts, journalctl is the cleanest option. You can limit the scan to recent entries and search for important keywords. That keeps the output useful instead of noisy. On systems that still rely on log files in /var/log, grep-based checks can work just as well.
journalctl --since "15 minutes ago" | egrep -i "error|failed|critical|panic|denied"
Authentication logs deserve extra attention. Repeated invalid logins, password failures, and denied SSH attempts can be early signs of brute force activity or a misconfigured service. That does not always mean an attack is underway, but it does mean someone should review it.
Logs are signal, not storage. A useful health check highlights a pattern or anomaly. It does not flood the reader with every line written by the system.
For log management and event handling, official guidance from CISA and NIST is a good reference point, especially when your environment is subject to audit or incident response requirements.
Adding Network Connectivity Checks
Network checks help separate a local server problem from an upstream dependency problem. A host can look fine internally while DNS fails, the default gateway is unreachable, or a critical API endpoint is down. That is why a health script should include at least one reachability test and one name-resolution test.
Use ping carefully. It is useful for basic reachability, but it should not be the only network check. Add DNS validation with getent hosts, nslookup, or dig if available. Then check listening ports or active connections with ss to make sure required services are actually bound.
- Ping gateway to verify local network reachability
- Resolve DNS names to confirm name service works
- Check listening ports for required daemons
- Test upstream dependencies such as database or API endpoints
If ping fails but local services are healthy, the issue may be routing or upstream connectivity. If DNS fails but ping works, the issue is likely name service. That distinction matters because it narrows troubleshooting quickly.
For network diagnostics, official documentation from vendors like Cisco® and standards bodies like IETF is more reliable than guesswork. In practice, that means your Bash script should report what failed and where, not just “network problem.”
Building Useful Output And Reporting
The output of a health check script should be easy to skim. Busy admins do not need prose. They need timestamps, hostnames, severity, and a short description of what failed. A consistent format makes it easier to search logs, compare hosts, and feed results into alerting tools.
Use labels such as OK, WARN, and CRITICAL. Include the hostname in every line so output from multiple machines can be correlated later. Save the results to a dated log file, such as one per day or one per host per day, depending on scale.
A practical summary might look like this:
- Timestamp: when the check ran
- Host: which machine was checked
- Severity: OK, WARN, or CRITICAL
- Component: CPU, disk, service, log, or network
- Details: short explanation and next action
If a threshold is exceeded, you can send the results by email, Slack, or another alert path already approved in your environment. Just keep the message concise. A warning that says “disk on /var at 91%, nginx active, memory available 1.2 GB” is useful. A wall of raw command output is not.
Key Takeaway
Good reporting is part of the health check. If the result cannot be scanned in seconds, the script is not done yet.
For incident handling and alerting practices, security operations teams often align with frameworks described by SANS Institute and NIST. Those references help justify consistent reporting standards across Linux systems.
Scheduling Health Checks With Cron Or Systemd Timers
Scheduling is where Bash health checks become truly valuable. Cron is still the simplest option on many systems. It is easy to set up, runs reliably, and works well for checks that do not need complex dependency handling. A check can run every 5 minutes for critical systems or every 15 minutes for less sensitive hosts.
For example, a high-value web server might check every 5 minutes, while an internal utility host could run every 15 minutes. The more important the system, the shorter the interval. Just make sure the script finishes before the next run starts. If it does not, add a lock file or flock so duplicate runs do not overlap.
systemd timers are a stronger option on modern Linux distributions. They offer cleaner logging, better dependency support, and tighter integration with service management. If your environment already uses systemd extensively, timers often make maintenance easier than cron.
*/5 * * * * /usr/local/bin/linux-health-check.sh >> /var/log/linux-health.log 2>&1
Whether you use cron or timers, redirect output to a file or configure mail notifications for unattended runs. The point is not just execution. The point is making sure someone sees the result when a threshold is crossed.
Official cron behavior is documented in traditional Unix and Linux references, while systemd timer behavior is covered in the systemd.timer manual.
Improving Reliability And Maintainability
A health check script is operational code. It should be maintained like one. Start by validating inputs and guarding against unexpected conditions. If a variable is empty when it should not be, fail safely and explain why. That is better than producing misleading output.
Use ShellCheck during development to catch common Bash problems such as unquoted variables, unsafe tests, and portability mistakes. It is one of the fastest ways to improve script quality before production use. Keep service names, thresholds, and notification targets in configuration variables so changes do not require editing logic.
Version control is another practical improvement. Even a small script benefits from change history, rollback capability, and review. When the script grows, break reusable pieces into separate utilities. For example, one helper can format output, another can handle alerting, and another can manage locks.
- Validate inputs before running checks
- Use configuration variables for thresholds and services
- Run ShellCheck on every meaningful edit
- Track changes in version control
- Split the script up if it becomes a toolkit
These are simple habits, but they pay off. They reduce false positives, lower maintenance time, and make the script easier to trust. For admins working under operational standards or audit controls, that reliability is part of the job.
Testing The Script Before Production Use
Do not place a new health script directly into production and assume it is correct. Test it on a non-production host first. Verify the normal output, then deliberately cause failures so you can see whether the script reports them accurately. That is the fastest way to catch logic mistakes and threshold problems.
Good test cases include stopping a service, filling a test filesystem, or creating a temporary load condition. You can also simulate permission issues by running the script without root, or remove a command to see how the script behaves when a dependency is missing. A healthy script should fail clearly, not silently.
- Run the script on a test host
- Confirm normal reporting looks clean
- Simulate one failure at a time
- Verify alerts, logs, and exit codes
- Tune thresholds to reduce false positives
It is also worth checking behavior under real scheduling conditions. Make sure output does not overlap if the script runs longer than expected. Review whether the exit code is meaningful to cron, systemd, or whatever consumes the result. A non-zero exit should represent a real failure, not a formatting issue.
This is where operational learning overlaps with defensive skill building. In CEH v13 topics, you learn to think about what can fail, how it shows up, and how to verify it. That mindset makes your Bash monitoring script more useful and more trustworthy.
Certified Ethical Hacker (CEH) v13
Learn essential ethical hacking skills to identify vulnerabilities, strengthen security measures, and protect organizations from cyber threats effectively
Get this course on Udemy at the lowest price →Conclusion
Bash scripts are still one of the best low-overhead ways to automate Linux system health monitoring. They are easy to deploy, easy to adapt, and powerful enough to watch the checks that matter most: CPU, memory, disk, services, logs, and network status. When those checks run on a schedule, you get earlier warning, better visibility, and faster response.
The real value comes from consistency. A script does the same thing every time, which means fewer missed issues and less dependence on memory or manual routines. Start with the most important checks for your server role, then expand the script as your environment grows. That is how practical automation becomes part of normal Linux operations, not a side project.
If you are building or improving this kind of workflow, ITU Online IT Training’s CEH v13 course is a useful place to strengthen the security mindset behind the scripting. The goal is not just to detect problems. The goal is to understand them early enough to act before users notice.
For further reference, use the official docs for Bash, systemctl, Linux vendor documentation, and operational standards from NIST and CISA. That combination gives you a solid foundation for reliable Linux monitoring.
CompTIA®, Cisco®, Microsoft®, AWS®, EC-Council®, ISC2®, ISACA®, and PMI® are trademarks of their respective owners.