Server maintenance prevents major problems and keeps things running well. Take time for this simple check on server hardware and software in your data center.
Data center servers are only sophisticated machines. Like other machines, they require regular maintenance to operate at peak performance. Simple maintenance procedures reduce serious service calls and extend the server’s working life.
Server Maintenance Checklist in the Data Center
Even with the performance and features of redundancy from modern servers, consolidation of increased workloads and hopes of reliability can bring smooth operation to you. The server maintenance checklist must include physical elements as well as critical system configurations.
Server administrators overlook the maintenance planning list too often. Don’t wait until there is a real failure. Set aside time for routine server maintenance according to the procedure.
The frequency of server maintenance depends on the age of the equipment, the data center environment, the volume of servers that require maintenance and other factors.
For example, older equipment on a server rack requires more frequent checks than new servers installed in a data center filtered with HEPA. Organizations can base routine maintenance schedules on vendor provider routines or third parties. If the vendor service contract requests a system inspection every four or six months, follow the schedule.
Preparation is everything
Have a plan before you handle the item on the server maintenance checklist. This includes checking the system log for any errors or events that require more direct attention. For example, if the system log shows an error with a particular memory module, you must order a replacement DIMM and have stock for installation. Likewise, if there are firmware, operating system or agent patches / updates available, test and vet patches before doing maintenance.
Have a clear plan to take the system offline and return it to service later. Before the advent of virtualization, server and resident applications needed downtime to accommodate maintenance windows. This often forces IT personnel to do maintenance at night or on weekends.
Virtual servers allow workload migration rather than downtime, so you can migrate applications to other servers and they will still be there whenever server maintenance occurs on the underlying host system. Before maintenance, we must know where the VM must go, migrate the VM to the selected system and verify each workload before lowering the server for maintenance.
At this point, you can usually turn off the server and delete it from the rack or other enclosure.
Make sure the server can breathe
Once the server is offline, visually check the external and internal airflow lines. Remove accumulated dust and other impurities that can block cold air.
Start with exterior air inlets and outlets, then enter the system chassis, see the CPU cooler and assembly fan, memory module and all cooling fans and air duct lines. Remove dust or dirt in the server room with a clean and dry air pressure device. Don’t clean the server on the shelf, pull it out first.
Cleaning dust is an old-fashioned process, but that doesn’t mean it’s obsolete. Dust is a thermal insulator, so it must be cleaned. At present, alternative cooling schemes and ASHRAE recommendations have increased data center operating temperatures. Dust and other air flow barriers will cause the server to use more energy, even triggering failure of components that can actually be avoided.
Check local hard disk
Many servers rely on internal hard disks for booting, startup and storage of workloads, user data, and other functions. Media disk problems can seriously interfere with workload performance and stability, often causing hard disk failure.
Magnetic media is not perfect. Common problems include bad sectors and fragmentation. RAID goes a long way to preserving data integrity after a storage error, but smaller, 1U rack servers do not provide enough physical space to use disk arrays. Use a utility tool such as CHKDSK ( Check Disk ) to verify disk integrity and try to restore bad sectors inside it. A Windows Server 2012 update version of CHKDSK can quickly analyze and correct disk problems in the file system structure.
Disk fragmentation will not just disappear, as long as the allocation table and file are NTFS or FAT allocation, the file system uses disk space by the cluster that is first available. Fragmentation can slow down the server disk and cause failure. A utility like Optimize-Volume under Windows Server 2012 can manage each cluster file continuously on disk.
Read the event log
The server records a lot of information in the event log, especially details about the problem. There is no complete server maintenance checklist without a review of the system, malware, and other event logs. Of course, critical system problems will be paid more attention by IT administrators and technicians, but there are many small problems that can indicate chronic and serious problems.
When you are there, check the report preparation and verification of the correct alert recipient and alarm. For example, if a technician leaves a server group, you must update the server reporting system. Double check the contact method too. Critical errors that are reported to the technician’s e-mail address may be completely inadequate if errors occur outside of working hours.
Be proactive with log data. When a log inspection reveals a chronic or recurring problem, a proactive investigation can resolve the problem before it increases. For example, if the log server report can recover errors in the memory module, this will not trigger a critical alarm. But problems repeatedly indicate problems with the module, and IT staff can make more detailed diagnoses to identify future failures.
If the problem is not severe enough to guarantee shutting down the server, the computer can resume operation until the replacement hardware comes in.
Take time to patch and update
Server stack software – BIOS, OS, hypervisors, drivers, applications, supporting tools – all must interact and work together. Unfortunately, software code is rarely problem free, so pieces of the puzzle are often patched or updated to fix bugs, improve security, facilitate interoperability and improve performance.
There is no production software that can be updated automatically. Administrators must determine whether patches or upgrades are needed, then evaluate and test changes as a whole. If the update can fix the problem then just don’t add it to other processes.
Software developers cannot test every potential combination of hardware and software. So, patches and updates can cause more problems than they fix on your specific server or software stack. For example, patch agent monitoring can cause performance problems with workloads that are important because new agents need more bandwidth than expected.
The shift to DevOps , with smaller and more frequent updates, exacerbates potential problems. You still need to test the patch or update in the lab before rolling it out. And always make sure you can cancel changes and restore the original software configuration if necessary.
Verify and record any changes
Much can happen to the server during maintenance, such as hardware, software, system configuration changes. If you have completed a server maintenance checklist, it is important for IT staff to verify and record the condition of the new system. For example, replace the network adapter, add or replace DIMMs, update the OS, and many other actions can change the system configuration.
Organizations that rely on system configuration management tools may need to update or “find” changes – record those changes to the configuration management database before the system is allowed to re-enter the service.
Also verification of system security postures such as firewall settings, anti-malware versions or settings for scanning and intrusion detection / prevention (IDS / IPS).
Security checks can help ensure that changes to the software system will not open an attack gap that may have been closed in the previous configuration.
And finally, don’t forget to update system backup or disaster recovery (DR) content after the server returns online. Verify that the server backup posture or frequency remains unchanged, unless specifically related settings need to be adjusted to reflect the changing server role.