Most of websites and enterprise application rely on the database backing them to store the application and customer data. So at some point the database could be the main performance and scalability bottleneck for your system performance, so I ‘m here today to cure this!
Database supporters and resisters:
Database supporters: MySQL, SQL Server, and PostgreSQL:
MySQL, SQL Server, PostgreSQL, and others is hard competitors, everyone have different philosophy, and features.
For example, SQL Server have rich features such as: support UTF16 in the data types, triggers, roles and policies, and integration with .NET, etc. MySQL: Free and open source, many database engines. PostgreSQL: Free and open source, support object-oriented data model.
What about the performance?
The Database performance is important. In the real and trusted performance benchmark: MySQL faster than SQL Server, and PostgreSQL have good performance better than SQL Server, and almost like MySQL. Also PostgreSQL, MySQL have small footprint, and consume small system resource.
Database resisters: HBase, MongoDB, Redis, and others:
Does all application really need databases backing them?
I believe there is no option can fit in all situations, so not all applications need database. The golden wisdom (in my opinion) said: Every challenge has its own solution.
There are many scenario will agreed with Database resisters, such as: Web search engine, Caching, File sharing, etc. In the other hand there are a lot of scenarios agreed with Database supporters, such as: ERP, CRM, E-Commerce, etc.
Tiny URL services for example, shouldn’t use Database at all because it’s require very simple needs map a small/tiny URL to the real/big URL. But what we can use beside or instead of Databases?
There are a lot of tools that fallowing CAP, BASE model, instead of ACID model. CAP model is about:
* Consistency – your data is correct all the time. What you write is what you read.
* Availability – you can read and write and write your data all the time.
* Partition Tolerance – if one or more nodes fails the system still works and becomes consistent when the system comes on-line.
For example in any in-memory or in-disk caching system you will never need all the Database features. You just need CAP like system. Today there are a lot of Documents, and key-value based store systems, such as:
* MongoDB Document oriented
* CouchDB Document oriented (JSON)
* Redis Key-value oriented
* Tokyo Cabinet Key-value oriented
* Memcached in-memory, Key-value oriented
The above list can work in many scenarios, and they provide good performance more than the Databases, and most of them support distributed, partitions, and they also provide good fault tolerance algorithm.
I know after all this there are still people say; Why CAP model or key-value store system, the database is good?
The Database is often a performance- and scalability limiting factor in most types of applications, the reality is that relational databases have not changed much in the last 30 years, and for a lot of purposes using a database is akin to running an interstellar spaceship while having an old Volkswagen engine in the engine room!
So the first lesson today is: Consider use Key-value store system depend on your needs.
Database Optimizing pattern:
What to store into the Database?
lot of people in our industry think about the Database as fundamental, default, primitive disk based store system, but this is wrong there are too many ways to store the data; beside that you shouldn’t store anything into the Database. In other word, don’t store non-relational data in relational data model.
For example if you design a photo sharing website, you shouldn’t store the thumbnails photos into the Database because simply when some user search the brewer will send a lot of requests to view this photos, and making a lot of DB connections is not good, beside this the DBMS will make this process heavy because the DBMS is complex system by nature.
So you should think twice before use the Database to store BLOB data.
Field data types:
In any DBMS you can found many kind of data types, such as: Date, Time, DateTime, Text, BLOB, VarChar, NVarCar (in SQL Server), and so on. If you choice the right data type this will make the database use lower disks space, and the insert/retrieve will run faster. But why I choice think about it? The Database by nature resides in disks and such disks are slow, so if you minimize the size of the stored data this will improve the write and read time. For example don’t use NVarChar data type except if you really need 16 bits Unicode (UTF 16). Also be watchful when you choice Text, NText, BLOB data type because mostly it will store in another location and the row will just hold the a reference to data position, so when you retrieve the data the DBMS will do a lot of seeks.
The primary key and the indexes:
We usually use it to optimize the retrieve, insert, update, and delete operations in the tables. Create a primary key on each table you create and unless you are really knowledgeable enough to figure out a better plan, make it the clustered index.
It’s important to minimize the primary field size and make the inserting always at the end of the table (i.e. make the value of the primary key field always increasing); I suggest use integer and auto increment in the primary key field if it’s possible.
To start with create your indexes; you need to plan an indexing strategy at the same time as you plan and design the database. First, and most important, is the clustered index (Primary key). As a rule, every table in an online transactional processing (OLTP) system should get a clustered index. You can only put one clustered index on a given table, so the proper selection of exactly what column or columns to place it on is extremely important.
It makes sense to leave the cluster on the primary key if that primary key provides the natural access path, the most common field used to either search the table or relate the table to another.
Changing the parameters slightly, if the access path is mostly through another column, says Company Name, this may make a more appropriate clustered index. Another situation is when the table is no longer at the top of the chain, but somewhere in the middle and the most likely avenue of selection for the data is going to come through the foreign key to a parent table. Then the foreign key column becomes a good candidate for the clustered index. A further wrinkle could be added by needing to get the related data in this child table, but based on a date. This would result in a clustered index composed of two columns: the foreign key and the date. As you can see, in a few sentences a number of options were laid out. You need to think this through as you design the system.
You may also identify, either during design, testing or monitoring the production system, that other indexes are needed. While multiple indexes on any given table may be needed, you need to keep in mind that each index adds overhead to the system because these values have to be maintained as the data gets modified, which includes inserts, updates and deletes. Further, since indexes are stored by pointing to the clustered index (one of the many reasons you need a clustered index on the table) changes to the clustered index can result in a cascade through all the other indexes on a table. Because of all this, while indexes are good and necessary, restraint must be exercised in their application & use.
Data retrieve, SP’s, and Ad-hoc queries:
When you submit Ad-hoc query to the DBMS, a number of processes on the server go to work on that query. The purpose of all these processes is to manage the system such that it will provide your data back to you, or store it, in as timely a manner as possible, whilst maintaining the integrity of the data. The processes actually take a time, some time it’s take long time depend on your query complexity.
So I suggest using SP (stored procedure) because most of those steps happen statically at compile time (i.e. creation time). For example you can in .NET – LINQ to SQL to import SPs and call it, and using a few changes you can also enforce the LINQ to SQL to use SPs in any insert, update, and delete operation for better performance.
Minimize the Ad-hoc queries will make you gain some performance but we can’t enforce the developers to forget this important feature. Beside create SP’s for the complex queries and the usually insert, update, delete operations; I suggest monitoring and tuning the Ad-hoc queries execute plans at stabilization and testing stage.
In general try to do the following:
- It’s important to return/select the columns and rows you need only.
- Avid using Like operation because it’s mostly require full table scan.
- Avoid NOT IN, instead use a left outer join – even though it’s often easier to visualize the NOT IN.
- Consider use set nocount on at the top of each SP (and set nocount off) at the bottom.
- Avid Long-Running, and the distributed transactions.
Caching is fundamental thing in reading intensive application, also caching is important in writing intensive application. If your DBMS support caching system, such as Query Caching in MySQL I suggest to turn it on.
Caching the not rapidly changing data such as: Logged-in user’s roles, top most posts, configuration, and Categories into the client/application server memory also is important to minimize the traveling to the server.