Basics of Good Server Design

Updated: Jan 29

Your key challenge when designing servers for any corporation is the fact that TCO, or total cost of ownership, tends to be a CTO, or similar level responsibility, while your buyer's focus will tend to be on CAPEX.



Getting your customer into a solution that performs properly and gives them the "most bang for their buck" hinges on proper design. Any bottleneck slows everything down. You can buy the highest end processors but if you don’t have enough memory, performance can still bottleneck.


How substantial the server build-out really needs to be is always based on how busy the server is going to be:

  • What is it doing? (how much compute is involved, ie: simple file transfer vs database)

  • How many people are using it?

  • Multiply design requirements times the number of VMs.

Note: Any good design always leaves room for growth. It is safer to design a server with too much performance, than to hamstring it with too little.


Processors:


Intel has Low, Standard, and High Performance families with specific orientations within each group. ie: For databases, cache is king. For older single threaded apps frequency is more important than cores.


Note: Moves within a family gain a few % points of performance, but moves between groups provide much larger performance jumps


Populating memory:


  • Largest dims have price premiums

  • Fill all memory paths on any bank for up to a 30% performance gain

  • Denser is cheaper (and leaves slots for future expansion)


Hard drives:


Spindle count + spindle speed = performance. Hard drives used to always be the bottleneck. That is not so much true with SSDs, but keep in mind that SSD's main advantages are for reads (up to 30 times faster than spinning disks). Also, WI, write intensive, SSDs are several times faster than RI, read intensive, because they write once vs multiple times to every memory space.

With table oriented databases, it is almost always a good idea to add a bootable RAID 1, so your customer can separate boot and log files from the database tables, which can give them up to 30% better database performance.


80/20 rule – Only fill hard drives up to 80% full. Calculate by starting with 120% of the space requirement, then adding parity drives to get to total drives required.


If you need 20TB figure 24TB total (120% of 20TB), then add parity drive/s:

  • RAID 5 – allow 1 drive for parity

  • RAID 6 – allow 2 drives for parity

  • RAID 1 or 10 – allow half the drives for parity (mirrored)

Also, try to cover data growth – 50% is average data growth every two years


Note: When counting on parity to keep your server up and running, keep in mind large NLSAS drives take a loooong time to rebuild.


One of the first things I do, even on a small server, is add a hardware-based RAID card.

  • Keeps hard drive failures from taking a server down

  • Keeps data from corrupting (and data corruption is a definite, although mostly hidden, factor in server downtime)

  • Essentially RAID doubles the speed of data flows

Note: Some operating systems handle all the RAID work and just want you to “get that hardware RAID card out of the way”. HBAs are your proper answer there.


Always make mechanical parts redundant, because those are the moving parts are the cause of 90% of physical server breakdowns:

  • 60% are from hard drives (if spinning disks)

  • 20% are from power supplies

  • 10% are from fans

  • 10% are from everything else

Note: VMs multiply risk, so any server running VMs should always be completely redundant.


Miscellaneous, but very important for more highly engineered servers, like the ones Dell manufactures:

  • Dell servers have three brains that must stay synced to keep their servers running smoothly; BIOS, iDrac and Lifecycle Controller.

  • Four out of five support issues (in large environments) can be traced back to firmware updates not being properly applied.

  • Dell server updates can be automated with OME, Open Manage Essentials. It is one of Dell’s hidden gems (and it is free!).

  • For Dell servers, basic support funnels you into a separate 9 to 5 queue manned by script readers. Time lost in basic troubleshooting can degrade customer experience and increase TCO and in any kind of mission critical environment, it creates unnecessary risks for downtime.

Note: In large companies, coordinating and standardizing or templating purchases (vs letting each department handle their own needs) is just one of the difference makers that can produce significant savings in the long run.