Today’s article is about Enhancing performance and throughput of database related application using data classification technique based on data access frequency & age of data.
Enhancing performance and throughput of database related application using data classification technique based on data access frequency & age of data
Abstract:
Efficient way of processing unanticipated queries is a crucial prerequisite for reducing cost of any generalized database management systems. A wide variety of approaches to improve the performance of query evaluation algorithms have been proposed. Most of such approaches are on the optimizing the query. This query optimization can reduce the overhead of a database system but yet need something more to do so. Our aim is to develop a framework that will categorize the data of a database into different group.
Each group will be defined by the age of records of the database. Each record of a particular age will distribute to a separate file. Each file will be a database file. Which means we will split the main database file into different database files. Thus categorizing of the data set may improve data accessibility. In this paper, we are proposing a data clustering mechanism based on data access frequency. We have considered only the hot data (most frequently accessed data) and the cold data (least frequently accessed data).
Here we divided the whole database into two separate files. The first file contains only hot data and the second file contains only the cold data. The time period of hot and cold data will vary for different application domains. The database engine will have direct access on the first database file and in case of unavailability of data; the database engine will look for the second database file. Finally, the experiment result shows how and why data accessibility time should outperform than other available data clustering
Introduction:
During the last two decades, Relational Database Management System (RDBM) has been established as the technology for data store. This has given us tremendous capability in order to work with data of sizes of terabytes where I tera byte is 1×10 12 bytes (Ms. M.C. 2012).
Now a days, in this arena of ubiquitous computing where data is stored in huge amount, the modern information technology allow us to produce even more powerful computers every year, which makes it possible to collect, store, transfer, and combine huge amounts of data at very low costs. As a result, companies and scientific research institutions can afford to build up large archives of documents.
The achieves are used for data like, numbers, tables, images, and sounds for manipulation. However, it is still confront for efficient querying and exploiting data in an intelligent way from a gigantic database.
The biggest challenge is to represent the data in an efficient and meaningful way. It will be almost impossible to handle or access this large amount of data if database operation is not performed in an optimized way or data is not properly clustered. If the data are properly clustered, then the desired data can be utilized as the similar data for further processing, i.e. same group of data can be used for further processing.
When the data access routines of that application are not optimized, the data base performance is dramatically poor almost 90% of the time (Syedur 2010). As a result, the time complexity of the data base increases and useless time has been spent by the organizations in order to modify and update the desired data.
So far, scientific research works have been done on reducing overhead of database related to memory uses and storage management. The usage of memory and storage management has played vital role in order to improve the time complexity of the desired database. Different viewpoints and different angles have been already used in order to improve the time complexity. But the concept of clustering data based on hot/warm/cold (frequently accessed, least frequently accessed and least accessed) has not yet been done.
From study on centralized databases and the different application areas, it was found that most data or records are stored but rarely or never accessed after a certain time frame. For example, in a super shop the sell records are kept in the database but after calculating the year ending, when the reports are produced, these records are not used for any other purposes. Table 1.1 Shows some of this type of application areas where data are not accessed after a certain time (Time Frame). The data which are frequently accessed are known as hot data where as the vice versa is called cold data.
Table 1.1: Data access frequency on hot cold data based on application area.
There are basically three kinds of operations in database: update, insertion and query operation.
The percentage of total access on hot data can be calculated by the following formula:
Where OH, is total operation on hot data, and IO, UO and QO is the insert, update and query operation on hot data respectively. And N is the number of applications area.
Similarly total operation on cold data can be calculated by the following formula:
OC, is total operation on cold data. From the above equation, it can be shown that the average data access on hot data is 89.83% and on cold data are 10.17%. So, it is absolutely clear that the data access on cold data is negligible compared to the access on hot data.
Based on this type of experiences, in this paper, we are proposing a framework where we will definitely give priorities on the hot data and will give less priority to the cold data. We may also consider those data that resides between hot and cold data as well for future research. The technique we propose is to split the database file into different priority files where frequently used records will reside in highest priority file and the least accessed records will reside in the least priority file.
For performance evaluation, the consideration will be based on hot data and cold data only. The age of the data will be determined by the database admin from the global configuration part of the framework.
For our experiment, we considered hot data are those, which were accessed within six months and cold data are those which were not accessed within one year. This is because, any records that were accessed within six month will be considered as hot data so after another six month this data will be treated as cold data. The splitting will not affect the overall data access performance because we will keep an index about the data content of each database file.
We will also analyze the performance of a very large database with and without our suggested techniques. We will show a comparison of our proposed framework with existing system in SQL Server databases. The above mentioned concept will definitely improve the overall performance of the proposed algorithm compared to the existing database. Although the technique is used for SQL Server databases but it could be used for most of the existed databases platforms.
RELATED WORK:
(Rahman, S. et. al., 2010) discusses the aspects of performance optimization related to data access in transactional databases. They have tried to prevent any possible performance issue that may take place later, by applying their knowledge before or along with their development activity.
Their methodology doesn’t focus on preventing rather than trying to recover after the problem occurs. (Li, B et. al. 2005) proposed some optimization principles and methods of oracle9i database capability in the aspects of database structure, SQL Sentence and memory assignation, etc. And the system capability has been enhanced to comparatively great extent by using these above-mentioned methods of optimizing database during the course of system application.
Their strategy gave some integrated and further conclusion about optimization of oracle9i database. It involves a lot of work and need often track diversified statistical targets and analyze the cause of capability change.
(Van Bommel, P et. al., 1992) focus on optimization of database schema transformation by evolutionary search or genetic search. A framework was designed to transform conceptual data schemata into efficient internal schemata was presented. They have considered the problem from the viewpoint of searching through the space of all correct, but possibly incomplete internal representations of the conceptual schema at hand. A search strategy was established, based on the use of evolutionary operators.
Query optimization in relational databases is benefited a lot from the simplicity of the data model. This is not the case with the object model. The object-oriented data model is a generalization of the relational one and is believed to eliminate many of its flaws through incorporating modern concepts. Object models are descended of the semantic networks and object programming languages.
They aim to permit the reuse of structures and operations to construct some more complex entities. (S. Minyar at al., 2013) Data are represented in the basis as of objects. Associations are implemented by the direct ties via object identifying that permit a fast navigational access between the different objects. Indeed, a query must use the new concepts introduced by object model (K.Subieta 1995).
The class is a data abstract type permitting to define properties of a whole of objects regrouped in two categories: attributes and operations (R.G.G. Catell, 1995)( G. Gardarin, 1999). The object is a triplet <OID, class, state>, the OID is identifying of the object. It’s unique and invariant during the program. The attribute is defined by its name and its type. An operation is a function that permits to modify the state of an object or to send back a value.
The inheritance is a transmission mechanism of properties of a class toward one under class. The inheritance is simple if the property is inherited of only one on-class. It is multiple when the property is present in several on-classes. The polymorphism is the fact to arrange operations having one same name but of the different parameters in number or in types.
Most of the query optimization methods are based on query rewriting (J. Plodzien). Rewriting means transforming a query ql into semantically equivalent query q2 promising much better performance. It consists in detecting parts of query matching some patterns. When it is recognized, a query is rewritten according to the predefined rewriting rule.
The advantage of rewriting is that algorithm are fast, such optimization is compile time optimization entirely performed before query is executed, hence query optimization process itself does not burden the performance. Process of optimization is summarizes in three steps shown in figure 2 Rewrite step consists in a syntactic and semantic rewrite of the query in the goal to determine simpler equivalent queries (J. Płodzień, 2000).
The result of this step is the generation of a query graph. Ordering operations step is takes place in two phases: generation and assessment of the plans which determined in the first phase. Execution step permits to choose the optimal execution plan and to execute it.
PROPOSED FRAMEWORK:
The methods of optimizing data organization according to the age of the data in a database. The idea of this method is based upon the observation that, not all data are used all the time and unnecessarily hamper in the cost of a query. Our proposal is to separate data in a large database according to its age, means the most recent data and the least recently accessed data are kept in different database files.
The first database file will contain most recent data and the second database will contain least used data and accordingly the database will be splitted into two separate database file. The working procedure of our proposed framework is very simple. Basically we will partition our database file into two database files, where each file will contain data according to frequency of access of a particular data. Database will be partitioned as DB1, DB2 where DB1 will act as primary database file. Any data may reside in file DB1, or DB2 depends on its use in main database (DB1).
Though it is possible that sometime DB2 database files may work as primary database in case a data is not found in DBI database file. The most recent used data (time frame will be configured by DB admin) will reside in main database (DB1) and according to data access date, data will transfer to DB2. In our proposed framework (as show in figure 3.2) we have used multiple processes which will have different level of priority and will run simultaneously.
Everything will be controlled from the Application layer. Inside framework engine we put some automatic processes whose job is to analyze the data used recently and according to the frequency of use of data, it will separate and send data to any of these two database files while the database engine is idle. Our observation from table 1 shows that operation on database file depends on the type of SQL command, for example, insertion operation always hot data, update operation maximum time on hot data and query operation happens on both hot and cold data.
DATA CLASSIFICATION MODEL:
The proposed framework would use multiple processes which will have different level of priority and will run simultaneously. Everything will be controlled from the Application layer. Inside framework engine we put some automatic processes whose job is to analyze the data used recently and according to the frequency of use of data, it will separate and send data to any of these two database files while the database engine is idle.
Since operation on database depends on the type of SQL commands, the proposed framework will have some components like SQL scanner, which in fact will be used to find out what type of operation it is and accordingly it will act on either of two database file. If it is an insert operation then it will directly access DB1 because 100% insertion operation will be in state hot data, and if it is an update, delete or a read operation, it will search DB1 first and if it is not found there, it will certainly be in DB2. Other components of the framework are shown figure 3.4.
Complete flowchart of Data Classification Model
STRUCTURE OF PERSISTENCE LAYER
Database Application: ERP, E-commerce, Education Management Application Software and Industrial Application etc.
SQL CommandScanner: As described above the job of SQL scanner is to identify the type of SQL command it is (insert, update, delete or read operation). For insert it will directly hit on DB1 and for other operations it will hit to DB2.
Insert Operation: Insert Operation directly access Primary Database (DB1).
Update, Delete, read Operation: Update, Delete or Read Operation, it will search DB1 first and if it is not found there, it will certainly be in DB2.
Global Configuration:Global configuration will have interface with application layer, where the application developer/user will have options to configure how database file will be splitted, the age classification of data etc. This system configuration will also be used for Database connectivity, clustering, data synchronization with others. The configuration includes:
a) Database Server Name or IP Address
b) Database Platform: MS SQL SERVER, MYSQL, ORACLE, Others
c) Database Name of Hot Data & Cold Data d) Database User
e) Database Password
f) Database Port (Optional)
g) Application Deployment Date & Time
h) Data Age Time Gap
i) Clustering Mechanism Configuration, etc.
Database Manager:Database manager will keep records of existed data on the database files. Any data exceeds the time frame given in the global configuration will be shifted from one database file to another (i.e. DB1 to DB2 or vice versa). Any data or record which was access from the DB2 will be converted as hot data, because according to our classification mechanism any data is accessed recently will be treated as hot data.
Algorithm:
DEs: Database Engine Status;
DSPDB: Data Set of Primary DB; DSSDB: Data Set of Secondary DB;
Tp: Time Period;
DataManager (DES, DSPDB, DSSDB, Tp)
IF DES is idle THEN
BEGIN
TLID+ Get Level 1 Data from DSSDB according to age of data & data
access frequency
TL2D+ Get Level 2 Data from DSPDB according to age of data & data
access frequency
Move TLID to DSPDB
Move TL2D to DSSDB
END
ELSE
Wait until Tp;
DB1 & DB2:DB1 (Store in Hot Data) & DB2 (Store in Cold Data) is a Relational database System Such as MSSQL, MySQL, Oracle & Others
IMPLEMENTATION
Our proposed model has been implemented in different programming language by using (Java, PHP) because increase execution time & reduce system resource and cross platform supported (Windows, Linux, Others). NetBeans IDE 8r2 has been used to write the source code of the persistence layer. NetBeans has been chosen because of its comfortable Integrated Development Environments. NetBeans is developed and maintained by Sun Microsystems. NetBeans has its own file format to maintain the source code. Oracle, MySQL, SQL Server and MS Access was used as the database.
Table 4.1.1: Software used for developing the framework (Java & PHP Platform)
Our proposed model requires and certain minimum hardware. Though we have used much higher configured computers for our purpose. Hardware used for our experiments are shown in table 4.1.2.
Table 4.1.2: Hardware used for implementing the framework.
Result & Discursion:
The functionality offered by our proposed model was for Microsoft’s windows operating system. For the implementation of our model we have used windows XP operating system in a stand-alone system. In our experiment SQL server 2005 was used as the database server. We have tested our framework with different volume of data.
For example we have configured our system to split our database after every three months (depends on global configuration) and each months we inserted 15, 00,000 (one and half million) numbers of data. The table 4.3 shows the time needed for a single query to execute in existing database and in our proposed framework. We considered the time needed to execute a query as cost for the system.
Table 4.3: Hot data & cold data
The cost metric is measured based on time in milliseconds. The clear observation is that the proposed algorithm out performs on currently existing algorithms proportionally as time passes by. In some cases our proposed system increases database’s performance up to 90%. From the above comparison we can clearly see that our framework performs better in all the cases and hence is ideal to be adopted.
Conclusion:
Idea of splitting an entire database is perhaps a new idea. Of course not all databases will be benefitted from our framework. In process industry, data change continuously with slight jitter(XU Hui, 2006). Append operation is much more frequent than query operation. Update operation occurs rarely (LI De- wen 2013) in any database. In cases where data are stored and after certain time those data are rarely accessed or sometime not accessed at all for long time. In our framework it will reduce overhead on database engine by reducing volume of data in the primary database. Hence it will increase performance of the database. The above experimental results showing a certain improvement in performance of our system.
Reference:
De-wen, L., Wen-jun, H. and Long, L., 2013, June. Design of real-time database for industry control system based on cloud theory. In 2013 10th IEEE International Conference on Control and Automation (ICCA) (pp. 363-367). IEEE.
Dhande, S.S. and Bamnote, G.R., 2012. Query Optimization in Object oriented Database through detecting independent sub queries. International Journal of Advanced Research in Computer science and software Engineering.(IJARCSS), ISSN.
Gardarin, G., 1999. Object and relational databases. Eyrolles.
Hameurlain, A. and Morvan, F., 2009. Evolution of query optimization methods. In Transactions on Large-Scale Data-and Knowledge-Centered Systems I (pp. 211-242). Springer Berlin Heidelberg.
Li, B., Liu, J., Zhu, Y. and Shi, L., 2005. OPTIMIZATION OF DATABASE CAPABILITY IN THE E GOVERNMENTAL SPATIAL AIDED DECISION-MAKING SYSTEM. In Proceedings of International Symposium on Spatiotemporal Modeling, Spatial Reasoning, Analysis, Data Mining and Data Fusion, Beijing, China (Vol. 36, p. 2).
Płodzie, J., 2000. Optimization Methods in Object Query Languages (Doctoral dissertation, Ph. D. Thesis. Institute of Computer Science, Polish Academy of Sciences).
Płodzień, J. and Kraken, A., 2000. Object query optimization through detecting independent subqueries.
Information Systems, 25(8), pp.467-490. Subieta, K., Beeri, C., Matthes, F. and Schmidt, J.W., 1995. A stack-based approach to query languages. In
East/West Database Workshop (pp. 159-180). Springer London.
Van Bommel, P. and van der Weide, T.P., 1992, July. Towards database optimization by evolution. In CISMOD (pp. 273-287).
Cattell, R.G., 1994. Object Data Management:
Object-Oriented and Extended. Addison-Wesley Longman Publishing Co., Inc..
Rahman, M.H., Akter, M.N., Ahmad, R.B. and Rahman, M., 2014, August. Development of a framework to reduce overhead on database engine through data distribution. In Electronic Design (ICED), 2014 2nd International Conference on (pp. 69-72). IEEE.
Sassi, M. and Grissa-Touzi, A., 2005. Contribution to the Query Optimization in the Object-oriented Databases. Journal of World Academy of Science, Engineering and Technology, WASTE Issue, (11). Rahman, S., Feroz, A.A., Kamruzzaman, M.N.F. and Faruque, M.N., 2010. Analyze Database Optimization Techniques. IJCSNS, 10(8), p.275.
See more: