4 August 2018

Tips for managing server infrastructure

Automation is very important

To manage infrastructure one of the most important elements of success is well-established monitoring. Since manually monitoring such a number of servers is physically impossible even with a large staff of engineers, it is necessary to use automation. The system must itself find possible problems and notify them. For such monitoring to be most effective, it is necessary to anticipate in advance all possible points of failure. Problems should be identified in the initial stages in an automatic mode, so that engineers can take necessary actions in the regular mode and not restore the already fallen services.

Backup can save in critical situations

In the case of large infrastructure, redundancy becomes even more important, since a large number of users will be affected in the event of a failure.

Worth reserving:

Monitoring servers - it is necessary not only to create a monitoring system for failures, but also to use monitoring monitoring tools.

Management tools

Communication channels - if there is only one provider in your data center, then in case of failure all your hardware will be completely cut off from the world.

Drawing up documentation and logs

When a project is supported by several engineers, it is very important to document the workflow as the infrastructure is upgraded. From the very beginning, you should create your own knowledge base and, together with the introduction of new options, prepare documentation for working with them. Even if all those who support the project already know what and how it works. With the expansion of the team of engineers, well-written documentation will help greatly speed up the process of familiarizing the new team member with the project.

It is important to analyze the reaction time of data center engineers

All adequate modern data centers provide a service to remote hands, where, in the event of problems, the server owner may ask the engineers on site to perform the necessary actions with the equipment. But not always as a result of a site render it qualitatively. The reasons can be many, one of the most frequent is the high load of specialists, or the absence of some engineers on the site, but in critical situations the reaction time is very important, and the problem can arise not only during working hours.

It is not necessary to strive for economy, to experiment more accurately

It is necessary to carefully study the characteristics of equipment - this will allow more accurately predict possible problems. For example, in the case of SSD, a little extra time spent on analysis can get equipment, which will last much longer than bought in a hurry.

It should also be prepared for the fact that this approach will not allow saving here and now. In the long run, saving on iron turns into losses - a lower price is always compensated for by low reliability, and repair and replacement of iron eventually cost more than buying more expensive equipment that will last longer.