Home >>Distributed DBMS Tutorial >DDBMS - Query Optimization in Distributed Systems

DDBMS - Query Optimization in Distributed Systems

DDBMS Query Optimization in Distributed Systems

Distributed Query Processing Architecture

Processing a query in a distributed database system consists of optimization at both the global and local levels. The query reaches the client or control site of the database system. Here, the user is validated, the query is checked, translated, and globally optimized.

You may represent the architecture as –

Optimization image

Mapping Global Queries into Local Queries

As follows, the process of mapping global queries to local ones can be realized

  • In a global query, the necessary tables have fragments spread across multiple sites. The local databases only have local data knowledge. To obtain information about the distribution, the controlling site uses the global data dictionary and reconstructs the global view from the fragments.
  • The global optimizer runs local queries at the sites where the fragments are processed if there is no replication. The global optimizer selects the site based on communication cost, workload, and server speed if there is replication.
  • A distributed execution plan is created by the global optimizer so that the least amount of data transfer happens across the sites. The plan describes the location of the fragments, the order in which the query steps and the processes involved in the transfer of intermediate results need to be performed.
  • Local queries are optimized by the local servers of the databases. Finally, in the case of horizontal fragments, the local query results are fused together by union operation and joint operation for vertical fragments.

For example, let us consider that according to the area, the following project schema is horizontally fragmented, the cities being New Delhi, Kolkata and Hyderabad.


PId City Department Status

Suppose there is a query for all projects whose status is "Ongoing" to retrieve details.

The global query will be &inus;

$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$

Query in New Delhi’s server will be −

$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$

Query in Kolkata’s server will be −

$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$

Query in Hyderabad’s server will be −

$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$

In order to get the overall result, we need to union the results of the three queries as follows −

$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup \sigma_{status} = {\small "ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$

Distributed Query Optimization

Distributed query optimization involves a large number of query trees to be evaluated, each of which produces the necessary query results. This is largely due to the existence of large amounts of repeated and fragmented information. The goal, therefore, is to find an optimal solution rather than the best solution.

The main problems for optimizing distributed queries are −

  • Optimal resource usage in the distributed system.
  • Trading by query.
  • Reduction of the query's solution space.

Optimal Utilization of Resources in the Distributed System

In order to perform the operations pertaining to a query, a distributed system has a variety of database servers on different sites. The methods for efficient use of resources are below.

Operation Shipping: The operation is carried out on the site where the data is processed and not on the site of the client during operation shipping. The findings are then moved to the client site. This is ideal for operations where operands are accessible at the same place. Example: Operations of Select and Project.

Data Shipping-The data fragments are moved to the database server during data shipping, where the operations are performed. This is used in systems in which the operands are distributed at different sites. In systems where the communication costs are minimal, and local processors are much slower than the client server, this is also acceptable.

Hybrid Shipping- This is a combination of shipping of data and activities. Here, data fragments are moved where the procedure runs to the high-speed processors. The findings are then sent to the client site.

Optimization image

Query Trading

The control / client site for a distributed query is called the buyer and the sites where the local queries are performed are called sellers in the query trading algorithm for distributed database systems. A variety of alternatives for selecting sellers and reconstructing the global results are formulated by the buyer. The buyer's aim is to achieve the optimal cost.

The algorithm begins with the buyer assigning the seller sites to sub-queries. From local optimised query plans proposed by the sellers in conjunction with the communication costs for reconstructing the final result, the optimal plan is generated. The query is performed until the global optimal plan is formulated.

Reduction of Solution Space of the Query

Generally, the optimal solution requires reducing the space of the solution such that the cost of query and data transfer is reduced. This, like heuristics in centralized systems, can be done by a set of heuristic rules.

Some of the rules are below:

  • As early as possible, perform selection and projection operations. This reduces the flow of data across communication networks.
  • Simplify horizontal fragment operations by eliminating selection criteria that are not relevant to a specific site.
  • Move fragmented data to the site where most of the data is present and perform operations there in the case of join and union operations consisting of fragments located in multiple sites.
  • Use the semi-join process to qualify the tuples to be joined. This reduces the amount of processing of data , which in turn lowers the cost of communication.
  • In a distributed query tree, merge the common leaves and sub-trees.