Pages

Sunday, April 29, 2007

SCALABILITY

For years it’s a matter of debate among developers, database administrators, and system admins on big-box vs. multi-box solutions and the advantages & disadvantages of “scaling up” by adding more memory, CPUs, etc., to one box versus “scaling out” by adding more less expensive boxes. The question is "How is scalability achieved"?

Application scalability can be defined as the ability to increase the application throughput in proportion to the hardware that is being used to host the application and ability of an application to continue to meet its performance objectives with increased load. In other words, if an application is able to handle 100 users on a single CPU hardware, then the application should on be able to handle 200 users when the number of processors is doubled.
Scale Up vs. Scale Out

There are two main approaches to scaling:

· Scaling up. With this approach, you upgrade your existing hardware. You might replace existing hardware components, such as a CPU, with faster ones, or you might add new hardware components, such as additional memory. The key hardware components that affect performance and scalability are CPU, memory, disk, and network adapters. An upgrade could also entail replacing existing servers with new servers.

· Scaling out. With this approach, you add more servers to your system to spread application processing load across multiple computers. Doing so increases the overall processing capacity of the system.


Scaling up refers to moving an application to a larger class of hardware that uses more powerful processors, more memory, and quicker disk drives. Scaling out refers to an implementation of federated servers, where consumer-class computers are added and where data is then partitioned or replicated across them.

Scaling out can be done by using functional partitioning. For example, you might scale out by putting your Customer Relationship Management (CRM) functionality on one server and your Enterprise Resource Planning (ERP) functionality on another server. Or, you could scale out by using data partitioning. For example, you might scale out by creating updatable partitioned views across databases.

Scaling out (or Horizontal Scaling) means distributing the computing and data workload among multiple commodity servers by load balancing, with the ability to add or subtract servers to increase or decrease capacity. By distributing the workload, processing resources are spread among multiple low-cost servers, which improves both performance and the availability of the overall service at a dramatically lower cost.

Scaling up (or Vertical Scaling) refers to running an application on a single large SMP server and having the ability to add hardware processors and memory to increase overall system performance and scalability. Scale-up implies fewer, more expensive servers than with scale-out. The big issue here is that because of the ‘forklift’ upgrade approach, you have too much high-cost hardware which is often under-utilized.

Deploying a server farm of industry standards servers is a better alternative to the traditional high-cost SMP systems solutions. As a result of the declining cost of powerful commodity processors, and open source software, IT managers have found that the scale-out model more cost-effectively delivers the performance, availability, and manageability previously found only in proprietary SMP technology.

Following scenarios must be considered, when it comes to addressing two common scalability bottlenecks:

Processor and memory-related bottlenecks. Scaling up is usually a good approach if your bottlenecks are processor related or memory related. By upgrading to a faster processor or by adding more processors, you maximize use of your existing hardware resources. You can resolve memory bottlenecks by adding additional memory or by upgrading existing memory.

Disk I/O–related bottlenecks. Scaling up can also help to resolve disk I/O–related bottlenecks. This form of bottleneck usually occurs in online transaction processing (OLTP) applications where an application performs random disk reads and writes, in contrast to sequential access in online analytical processing (OLAP) applications. For OLTP applications, the I/O load can be spread by adding disk drives. Adding memory also helps reduce I/O load. Because the I/O load is reduced, the size of the SQL Server buffer cache increases. As a result, page faults are reduced.


Following guidelines should be considered before it is decided to scale up or scale out:

· Optimize the application before scaling up or scaling out.

· Address historical and reporting data.

· Scale up for most applications.

· Scale out when scaling up does not suffice or is cost-prohibitive.

Benefits of Scale-Out

· Cost-effectively add capacity to accommodate growth

· Reduce costs using commodity hardware and software

· Improve scalability by distributing load across servers

· Improve performance using multiple storage engines

· Improve availability using high-quality software


Scale Out Advantage
The scale-up model is not a cost-effective solution to address performance and scalability issues associated with database growth. Scaling up requires expensive and sophisticated hardware and operating systems to deliver scalability and availability to business applications.

· Scale up requires a huge up-front investment. Plus once a server has been fully configured with CPUs and memory, the next step is an expensive "fork-lift upgrade" to add capacity.

· Scale up does not provide linear or near linear scalability. Performance flattens out and further scaling up requires more high-cost hardware upgrades to get very modest performance improvements.
Scale-out enables organizations to cost-effectively solve database capacity issues that result from increased traffic and transaction volumes.

From an application perspective, scaling out provides other advantages.

Administration of conflicting needs Many times independent process require different versions of the same software, or worse, different versions of a shared library (*.so). Conflicting needs occur when independent processes are required to run on the same box, in the same user and process space. Due to the conflicting needs of all the independent process, multi-box solutions is easier to admin. [Many people argue this simple solved with multiple Virtual Machines (VM), however VMs presents another set of risk and cost.

Root cause analysis Problems will occur when software that is defective or troublesome is delivered. If all the software runs on a single box, in a single process space, determining the cause can be difficult to discover. Scaling out separate process to different boxes provides an easier method of determining the problem.

Defective software isolation Defective/Troublesome/Buggy software impacts everybody. However, the impact can be minimized of the defective software is isolated to a single box rather the impacting the entire application.

Failover The common approach is to provide numerous boxes of the same application and configuration. If a single box fails, the traffic is routed to another box. The user session is maintained by a constant serialization of user session data between boxes. Constant serialization of data among boxes has a myriad of issues. For instances, transient versus non-transient data, session logging propagation, limitations on certain design patterns, a unique awareness of global data. Defective software is the largest problem with the multi-box failover model. If a flaw in specific user flow is the cause of a production failure, moving the user to another box will only cause the second box to fail. Since most failures are due to defective software, spreading the software in multi-box failover model does not fix the problem.

Advantages

Load balancing Simple round-robin request between boxes does not solve the problem of load balancing. Determining loads of specific processes, and properly design for the specific processes, is the best way of handling application load. Scaling out provides greater opportunities for tuning the operating system and processes.

Right sizing It easier to right size the multi-box architecture. Sliding another blade server into a rack cost less then adding hardware to a big-box
Security In every large IT organization, the developers are not allowed access to the production servers. The security aspect hampers the ability to evaluate problems in production. Utilizing logging to determine behaviour leads to its own set issues including increased load the server, and extensive amount of coding.

· Process Expansion In pipeline architecture, a process can be arbitrarily 'scaled' by substituting any number of identical sub-processes. This takes advantage of the Rules for Queues (many processes may feed from a Queue, a Queue may be fed by many processes)

Easily and cost-effectively add capacity to your database infrastructure.

Reduced Hardware costs - adding several smaller systems is typically far less expensive than upgrading a mainframe-class system.

Improve response time and availability – Scale-out improves the performance and availability of you system. Users experience fewer interruptions in accessing data.

Increased flexibility – Right-size the initial purchase of commodity hardware and software and have the flexibility to incrementally add capacity as needed.

Reduce the risk of performance degradation

Improve scalability Replication to distribute large workloads to individual server nodes for execution.

Improved Performance

Pros and Cons

Scaling up is a simple option and one that can be cost effective. It does not introduce additional maintenance and support costs. However, any single points of failure remain, which is a risk. Beyond a certain threshold, adding more hardware to the existing servers may not produce the desired results.

For an application to scale up effectively, the underlying framework, runtime, and computer architecture must also scale up.

Scaling out enables you to add more servers in the anticipation of further growth, and provides the flexibility to take a server participating in the Web farm offline for upgrades with relatively little impact on the cluster. In general, the ability of an application to scale out depends more on its architecture than on underlying infrastructure.
When to Scale Up vs. Scale Out ?

Should you upgrade existing hardware or consider adding additional servers? To help you determine the correct approach, consider the following:

· Scaling up is best suited to improving the performance of tasks that are capable of parallel execution. Scaling out works best for handling an increase in workload or demand.

· For server applications to handle increases in demand, it is best to scale out, provided that the application design and infrastructure supports it.

· If an application contains tasks that can be performed simultaneously and independently of one another and the application runs on a single processor server, you should asynchronously execute the tasks. Asynchronous processing is more beneficial for I/O bound tasks and is less beneficial when the tasks are CPU bound and restricted to a single processor. Single CPU bound multithreaded tasks perform relatively slowly due to the overhead of thread switching. In this case, one can improve performance by adding an additional CPU, to enable true parallel execution of tasks.

How to scale

Reducing Bottlenecks
Behind every page load lurks potential processing bottlenecks. While you're thinking through application workflows, you have an opportunity to get the application architecture right, to avoid performance penalties and to simplify component distribution changes and maintenance.

Here are a list of some application processing scenarios and some high-level approaches to application architecture that can increase performance and reliability therin:

· Heavy Database Load. A qualified DBA is very much needed to tailor performance tuning activities to meet the needs of the application (and believe me, it is not easy to find a fabulous DBA!). However, it is also possible to mitigate performance concerns by deploying the database engine on a separate tier, with the right hardware configuration including mirrored drives

· Long Running Operations. Operations such as database queries or inserts involving large resultsets, heavy number crunching and remote invocations can cause messages to be queued causing delays in responses. These activities should be considered for asynchronous messaging. Memory is volatile and servers can fail, these are harsh realities. To mitigate the risk of losing request data during a round trip, to insure reliable processing of that data, and to offload the work from the ASP.NET worker process, you can employ Microsoft Message Queuing (MSMQ) easily from the .NET framework with System.EnterpriseServices.

· Resource Intensive Features. Sometimes we have to hit the file system, for example, generating reports or PDF documents may ultimately require persisting file output. Number crunching can also be a resource intensive process, consuming large amounts of memory and consuming CPU cycles. Both of these are different examples of resource intensive features that may need to be offloaded to another physical tier. Employing MSMQ and COM+ once again with the help of components available in the System.EnterpriseServices namespace, you can offload work to other tiers in a reliable architecture.

· Server Down Conditions. Yes, it happens, servers go down, and MSMQ can help you recover in several ways. First, messages can be recorded (serialized) so that if a server goes down, upon restart those messages are ready and waiting to be played. Second, if a queue is trying to invoke a component on another tier that is currently unavailable, or an exception occurs, messages are passed through a series of retry queues before finally resting in a final, dead letter queue. Of course, there are a number of ways to configure this, but the thrust of this is that no message is lost.

· Distributed Transactions. With all of this talk about application tiers and component architecture, I would be remiss if I left out the need to manage distributed transactions. Luckily, COM+ components have built-in capabilities that leverage the Microsoft Distributed Transaction Coordinator (DTC).
By employing the right network architecture and equipment, combined with some combination of multithreading, message queuing, distributed application processing and loosely coupled events, your application has the potential to scale better and provide the kind of reliability customers expect.
In the remainder of this article, I will give you an overview of a sample application I developed that employs some of these concepts in applied scenarios. Consider this a starting point to tickle your interest in solving some of the scalability and reliability problems I have discussed so far with sound architecture and component design.