This is the first post from a series of posts
during which I intend show you some interesting things and also make you aware
of the importance of not just writing SQL, but understanding how SQL is
processed.
Here I will try to explain some basic internal
details in every relational database management server (RDBMS), with emphasis
on Oracle server arcitecture.
Databases are the main core in every business,
because they are the base for record keeping. There are a multitude
implementation of RDBMS: MySQL, DB2, PostgreSQL, SQL Server, Oracle, Informix, GreenPlum
(this is new to me) etc., each with its specificity, advantages and , in some
cases, limitations.
Most
RDBMS, if not all, use a client-server implementation, where first, client
interacts with user and takes input from user, then passes it to server. Server
then will query the database to get data related to the query and will send
data back to client, that will then use the data retrieved from database. From
the implementation point of view, a RDBMS is made of a server and a database,
the server or instance has the processes and internal memory structures, and
database has only the files with data (datafiles).
The
final scope of all databases, beyond the storage of data, is to maximize the
throughput and speed of retrieving data to the client. This is achieved through
caching data blocks and internal structures that are accessed frequently.
The
best example in importance of cache usage, is the database implementation
behind Facebook, that uses a large MySQL cluster for data storage and a large
implementation of MemCached servers for caching requested data.
In
Oracle’s implementation of caching, it’s used the main memory of server (SGA –
system global area) that is structured in specific zones to hold:
·
copies
of data blocks read from datafiles (Buffer Cache)
·
information
about changes made to the database (Redo log buffer)
·
runtime information (Shared Pool),
that id distributed in several areas:
o
one for
the library cache that contains:
§
shared
SQL areas to hold the parse tree and execution plan for a given SQL statement,
for reuse,
§
private
SQL areas (in the case of a shared server configuration) to hold specific
information about each SQL run
§
PL/SQL
procedures and packages,
§
control
structures such as locks and library cache handles
o
one for
the dictionary cache, that holds a collection of database tables and views
containing reference information about the database, its structures, and its
users
o
one for
the result cache, that is composed of the SQL query result cache and PL/SQL
function result cache,
o
one for
buffers for parallel execution messages, and control structures
·
data needed mainly in I/O intensive
operations (Large Pool)
·
information
related to all session-specific Java code and data within the JVM (Java pool)
·
structures
specific to Oracle Streams (Streams Pool)
In
addition to this memory structure, Oracle Server allocates a program global
area (PGA) to each server process, and it is used to processing SQL statements and to hold logon and other
session information.
That is why in accessing a database server
through SQL language is crucial to use to the maximum these memory structures:
·
Buffer
cache to hold frequently accessed data, through caching
·
Shared
Pool to store frequently accessed SQLs in an analyzed and optimized form, by
reusing frequently used statements.
In complete processing of an SQL, there are
several stages: parse, compile, optimize, execute and fetch. When the analyzed
form of an SQL is already in memory (Shared Pool), its processing will have
only execute and fetch stages. This is the case for SQLs with bind variables
and prepared statements.
The
heart of the Oracle server, when it comes to optimize an SQL, is the CBO (Cost
Based Optimizer) that is used to analyze all the available information
regarding data, in order to find the best execution plan. To achieve this, the
optimizer access information related to access paths to data, join methods and
order, data statistics, hardware performance details (CPU, I/O, memory, RAID)
and internal parameters.
Thus, the optimizer is influenced globally in
his decision, mainly by instance parameters like optimizer_mode. This can be
set to be optimized for throughput for all data (ALL_ROWS), usually used in DWH
databases, or to be optimized for quick retrieval of first n rows from a query (FIRST_ROWS(n)), usually used in OLTP systems.
This is one of many parameters that can be set
at instance (ALTER SYSTEM), session (ALTER SESSION) or statement level (using
hints). The result of the CBO activity will be an execution plan with the
smallest cost associated to its execution.
Any databases that I mentioned above have a
specific implementation of optimize engine that is used in choosing the best
way to execute a SQL.
As a conclusion rule, when writing a query to
a specific RDBMS, try to understand what the server expects to receive, and
what the basic rules in taking a decision are, because more of the times the
optimizer has a very different approach regarding our queries.
During following
posts I will show you some internal thinking from the server point of view,
step by step.
No comments:
Post a Comment