3zecisi3: HOW TO SQL

This is the first post from a series of posts during which I intend show you some interesting things and also make you aware of the importance of not just writing SQL, but understanding how SQL is processed.

Here I will try to explain some basic internal details in every relational database management server (RDBMS), with emphasis on Oracle server arcitecture.

Databases are the main core in every business, because they are the base for record keeping. There are a multitude implementation of RDBMS: MySQL, DB2, PostgreSQL, SQL Server, Oracle, Informix, GreenPlum (this is new to me) etc., each with its specificity, advantages and , in some cases, limitations.

Most RDBMS, if not all, use a client-server implementation, where first, client interacts with user and takes input from user, then passes it to server. Server then will query the database to get data related to the query and will send data back to client, that will then use the data retrieved from database. From the implementation point of view, a RDBMS is made of a server and a database, the server or instance has the processes and internal memory structures, and database has only the files with data (datafiles).

The final scope of all databases, beyond the storage of data, is to maximize the throughput and speed of retrieving data to the client. This is achieved through caching data blocks and internal structures that are accessed frequently.

The best example in importance of cache usage, is the database implementation behind Facebook, that uses a large MySQL cluster for data storage and a large implementation of MemCached servers for caching requested data.

In Oracle’s implementation of caching, it’s used the main memory of server (SGA – system global area) that is structured in specific zones to hold:

· copies of data blocks read from datafiles (Buffer Cache)

· information about changes made to the database (Redo log buffer)

· runtime information (Shared Pool), that id distributed in several areas:

o one for the library cache that contains:

§ shared SQL areas to hold the parse tree and execution plan for a given SQL statement, for reuse,

§ private SQL areas (in the case of a shared server configuration) to hold specific information about each SQL run

§ PL/SQL procedures and packages,

§ control structures such as locks and library cache handles

o one for the dictionary cache, that holds a collection of database tables and views containing reference information about the database, its structures, and its users

o one for the result cache, that is composed of the SQL query result cache and PL/SQL function result cache,

o one for buffers for parallel execution messages, and control structures

· data needed mainly in I/O intensive operations (Large Pool)

· information related to all session-specific Java code and data within the JVM (Java pool)

· structures specific to Oracle Streams (Streams Pool)

In addition to this memory structure, Oracle Server allocates a program global area (PGA) to each server process, and it is used to processing SQL statements and to hold logon and other session information.

That is why in accessing a database server through SQL language is crucial to use to the maximum these memory structures:

· Buffer cache to hold frequently accessed data, through caching

· Shared Pool to store frequently accessed SQLs in an analyzed and optimized form, by reusing frequently used statements.

In complete processing of an SQL, there are several stages: parse, compile, optimize, execute and fetch. When the analyzed form of an SQL is already in memory (Shared Pool), its processing will have only execute and fetch stages. This is the case for SQLs with bind variables and prepared statements.

The heart of the Oracle server, when it comes to optimize an SQL, is the CBO (Cost Based Optimizer) that is used to analyze all the available information regarding data, in order to find the best execution plan. To achieve this, the optimizer access information related to access paths to data, join methods and order, data statistics, hardware performance details (CPU, I/O, memory, RAID) and internal parameters.

Thus, the optimizer is influenced globally in his decision, mainly by instance parameters like optimizer_mode. This can be set to be optimized for throughput for all data (ALL_ROWS), usually used in DWH databases, or to be optimized for quick retrieval of first n rows from a query (FIRST_ROWS(n)), usually used in OLTP systems.

This is one of many parameters that can be set at instance (ALTER SYSTEM), session (ALTER SESSION) or statement level (using hints). The result of the CBO activity will be an execution plan with the smallest cost associated to its execution.

Any databases that I mentioned above have a specific implementation of optimize engine that is used in choosing the best way to execute a SQL.

As a conclusion rule, when writing a query to a specific RDBMS, try to understand what the server expects to receive, and what the basic rules in taking a decision are, because more of the times the optimizer has a very different approach regarding our queries.

During following posts I will show you some internal thinking from the server point of view, step by step.

Tuesday, September 18, 2012

HOW TO SQL - 1

No comments:

Post a Comment