Saturday, July 15, 2017

Exadata 12c PX Adaptive Offloading

Here is yet another case when you may not see as much offloading on your Exadata as you expect.

I was recently investigating a simple select count(*) query producing a lot of buffered read events:
select count(*) from whs.trans partition (sys_p10236650) f

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  Disk file operations I/O                        1        0.00          0.00
  cell single block physical read                41        0.00          0.04
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           24        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.06
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.08          1.57
  PX Deq: Signal ACK EXT                         24        0.01          0.01
  PX Deq: Slave Session Stats                    24        0.00          0.00
  cursor: pin S wait on X                        23        0.01          0.18
  PX Deq: Execution Msg                         362        0.25          5.12
  cell multiblock physical read                3402        0.03          9.07
  gc current grant busy                           5        0.00          0.00
  latch: gc element                              60        0.00          0.00
  gc cr grant 2-way                             136        0.00          0.03
  gc cr multi block request                     131        0.00          0.06
  cell smart table scan                        7714        0.02          2.27
  library cache: mutex X                         15        0.00          0.00
  library cache pin                               3        0.00          0.00
  latch: cache buffers lru chain                 13        0.00          0.00
  latch: ges resource hash list                   1        0.00          0.00
  gc current block 2-way                          1        0.00          0.00
  cell list of blocks physical read               2        0.00          0.01
  gc cr grant congested                           2        0.00          0.00
  latch: object queue header operation            2        0.00          0.00
  latch: gcs resource hash                        3        0.00          0.00
********************************************************************************
As you can see we've got 3402 cell multiblock physical read and 41 cell single block physical read alongside 7714 cell smart table scan events. I looked at all the usual smart scan preventing suspects but came up with nothing.

Tracing

In order to further understand the source of buffered reads I've enabled the following traces:
alter session set events 'trace[nsmtio]';
alter session set "_px_trace" = high,granule,high,execution;
What I discovered in the trace was quite interesting. For smart scans the following sections were present in the trace:
2017-07-15 13:54:17.777305*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:1 offload enabled:1 ratio:0.800000 cellrate:0.000000 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[DirectRead]: Buffer Cache Consumption rate of granule is less and direct read enabled.
...
WAIT #140442514780840: nam='cell smart table scan' ela= 208 cellhash#=459971463 p2=0 p3=0 obj#=36324502 tim=13309302766869
WAIT #140442514780840: nam='cell smart table scan' ela= 224 cellhash#=83369134 p2=0 p3=0 obj#=36324502 tim=13309302767638
WAIT #140442514780840: nam='cell smart table scan' ela= 157 cellhash#=3883045144 p2=0 p3=0 obj#=36324502 tim=13309302768329
...
And for buffered reads the trace had the following:
2017-07-15 13:54:18.151520*:PX_Control:kxfx.c@10122:kxfxsGetNextGranule():  Receiving 1 granules
...
kxfxContAdaptOff: offload granule:0 offload enabled:1 ratio:0.800000 cellrate:1320.421053 cacherate:0.000000 dr_amount:0 dr_time:0 bc_amount:0 bc_time:0 system offloading ratio:0.800000
...
NSMTIO: kxfxghwm:[GRANULE_AFFINITIZED]: Either size is small OR medium and can be cached.
...
WAIT #140442514780840: nam='cell multiblock physical read' ela= 2057 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303143810
WAIT #140442514780840: nam='cell multiblock physical read' ela= 3162 cellhash#=2689352169 diskhash#=3123344858 bytes=1048576 obj#=36324502 tim=13309303151338
WAIT #140442514780840: nam='cell multiblock physical read' ela= 1383 cellhash#=2689352169 diskhash#=3123344858 bytes=450560 obj#=36324502 tim=13309303156489
...
The above sections were alternating for each granule received. After receiving a granule the entire granule was processed using either smart scans or buffered reads. It also appeared that the decision was driven by the kxfxContAdaptOff (Control Adaptive Offloading?) function. Notice kxfxContAdaptOff emitting cellrate statistic into the trace as well. This got me thinking that it might be checking for storage cells workload (or something similar) and using it as one of the inputs to arrive at the decision which way to process a granule.

PX Adaptive Offloading

With a little bit of outside help kxfxContAdaptOff function lead us to two underscore parameters:
SQL> select ksppinm, ksppstvl, ksppstdfl, ksppdesc
  2   from x$ksppi x, x$ksppcv y
  3   where (x.indx = y.indx)
  4    and ksppinm like '_px_adaptive_offload%';
KSPPINM                         KSPPSTVL KSPPSTDFL KSPPDESC
------------------------------- -------- --------- --------------------------------------------------------------------------------
_px_adaptive_offload_threshold  10       10        threshold (GB/s) for PQ adaptive offloading of granules
_px_adaptive_offload_percentage 80       30        percentage for PQ adaptive offloading of granules
The description of these parameters appeared to match the behavior I was seeing. It appeared that there was a threshold and a percentage controlling how much granules were subjected to adaptive offloading.

_px_adaptive_offload_threshold

I have repeated my test using _px_adaptive_offload_threshold=0 and here are the events I've got:
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  library cache pin                               2        0.00          0.00
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: PS - contention                           39        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.02
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.23          4.46
  PX Deq: Signal ACK EXT                         24        0.06          0.07
  PX Deq: Slave Session Stats                    24        0.00          0.00
  library cache lock                              1        0.00          0.00
  PX Deq: Execution Msg                         362        0.47         14.54
  cell multiblock physical read               14536        0.04         44.74
  gc current grant busy                          24        0.00          0.01
  gc cr grant 2-way                             626        0.00          0.11
  gc cr multi block request                     553        0.00          0.18
  cell single block physical read              3459        0.02          2.69
  latch: gcs resource hash                       29        0.00          0.01
  latch: gc element                             149        0.00          0.02
  latch: cache buffers lru chain                 42        0.00          0.01
  latch: cache buffers chains                     4        0.00          0.00
  latch: object queue header operation            3        0.00          0.00
  gc cr grant congested                           1        0.00          0.00
********************************************************************************
Notice complete absence of smart scans -- everything got processed using buffered reads! This confirmed that I was on the right track. And here are the events with _px_adaptive_offload_threshold=1000:
Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  resmgr:pq queued                                1        0.00          0.00
  enq: KO - fast object checkpoint                3        0.02          0.02
  reliable message                                1        0.00          0.00
  enq: PS - contention                           29        0.00          0.00
  PX Deq: Join ACK                               48        0.00          0.00
  PX Deq: Parse Reply                            24        0.01          0.03
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       2        0.00          0.00
  SQL*Net message from client                     2        0.00          0.00
  PX Deq: Execute Reply                         338        0.09          1.37
  PX Deq: Signal ACK EXT                         24        0.08          0.09
  PX Deq: Slave Session Stats                    24        0.00          0.00
  PX Deq: Execution Msg                         362        0.45         16.56
  cell multiblock physical read                2018        0.10          7.04
  gc current grant busy                           5        0.00          0.00
  gc cr grant 2-way                              57        0.00          0.00
  gc cr multi block request                      41        0.00          0.01
  cell smart table scan                        8641        0.10          5.23
  cell single block physical read               602        0.02          0.58
  library cache pin                               2        0.00          0.00
  latch: cache buffers lru chain                  2        0.00          0.00
  latch: gc element                               4        0.00          0.00
  latch: object queue header operation            1        0.00          0.00
  library cache: mutex X                          1        0.00          0.00
********************************************************************************
Notice that while I've got more smart scans the buffered reads were still there as well. Interestingly enough I could not fully eliminate buffered reads no matter how high I've set this parameter to. This likely means that there are more variables at play here and/or that _px_adaptive_offload_percentage dictates that a certain amount of granules always gets processed using buffered reads.

I'm not entire sure what _px_adaptive_offload_percentage does at the moment as setting it to one value or the other did not appear to make any changes. While this is a guess on my part the default value (80) matches ratio:0.800000 reported by kxfxContAdaptOff and it appears to be hard coded in the function.

Final thoughts

Clearly 12c got some new code introduce which can dynamically make a decision on how to process each PX granule. The unfortunate part is none of this appears to be documented and it looks like quite a big change which warrants some sort of an explanation.

Updated a little bit later...

It appears that setting _px_adaptive_offload_threshold=0 disables Adaptive Offloading after all. The reason I've got only buffered reads when I set it to 0 was due to the system running with Auto DOP so In-Memory PQ kicked in. When set together with _parallel_cluster_cache_policy=adaptive it reverted everything to the expected behavior -- everything got processed using smart scans.

Friday, July 07, 2017

12.2 ACFS compression, part I

One of the new 12.2 features is the ability to transparently compress ACFS filesystems.

Compression is enabled using acfsutil utility:
[root@raca1 ~]# acfsutil -h compress on
Usage: acfsutil [-h] compress on [-a ] 
                              - Set default compression algorithm
                                Currently only 'lzo' available
                 - Enable compression on volume
Clearly there is support for more compression algorithms to be added in the future but right now only lzo is supported.

Let's go ahead and enable compression on the ACFS filesystem I have:
[root@raca1 ~]# acfsutil compress on /u02/oradata
Compression status can be checked using acfsutil info fs:
[root@raca1 ~]# acfsutil info fs /u02/oradata
/u02/oradata
    ACFS Version: 12.2.0.1.0
    on-disk version:       46.0
    compatible.advm:       12.2.0.0.0
    ACFS compatibility:    12.2.0.0.0
    flags:        MountPoint,Available
    mount time:   Fri Jul  7 12:53:39 2017
    mount sequence number: 0
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   8589934592  (   8.00 GB )
    total free:   6935588864  (   6.46 GB )
    file entry table allocation: 393216
    primary volume: /dev/asm/data-95
        label:
        state:                 Available
        major, minor:          250, 48641
        logical sector size:   4096
        size:                  8589934592  (   8.00 GB )
        free:                  6935588864  (   6.46 GB )
        metadata read I/O count:         23833
        metadata write I/O count:        39103
        total metadata bytes read:       257896448  ( 245.95 MB )
        total metadata bytes written:    421969920  ( 402.42 MB )
        ADVM diskgroup:        DATA
        ADVM resize increment: 67108864
        ADVM redundancy:       unprotected
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: ENABLED
As a quick test I've created a new empty database inside ACFS filesystem. We can now check compressed file sizes:
[root@raca1 ora12cr2]# acfsutil compress info /u02/oradata/ora12cr2/sysaux01.dbf
Compression Unit size: 8192
Disk storage used:   (  88.70 MB )
Disk storage saved:  ( 461.31 MB )
Storage used is 16% of what the uncompressed file would use.
File is not scheduled for asynchronous compression.
[root@raca1 ora12cr2]# ls -l /u02/oradata/ora12cr2/sysaux01.dbf
-rw-r----- 1 oracle dba 576724992 Jul  7 13:45 /u02/oradata/ora12cr2/sysaux01.dbf
It is worth noting that not all file types are compressed. For example redo logs are left uncompressed:
[root@raca1 ora12cr2]# acfsutil compress info /u02/oradata/ora12cr2/redo01.log
The file /u02/oradata/ora12cr2/redo01.log is not compressed.
So far so good -- the feature indeed allows you to save some space by compressing the files. I will continue exploring this capability in the next part.

Monday, May 22, 2017

Oracle SuperCluster M7 SLOB LIO Tests vs Intel Xeon E5-2699 V4

Here are some SLOB LIO figures from a DB zone configured with 16 threads running on an Oracle SuperCluster M7 hardware. For comparison I've also included numbers from an Intel Xeon E5-2699 V4 CPU.

It makes sense to mention that this is not exactly a fair comparison -- a single SPARC M7 core has 8 threads associated with it so my zone is able to utilize a total of two SPARC M7 cores (16 threads total with 8 threads per core). E5-2699 V4 is currently top of the line Intel CPU core packed model with 22 cores. So we're comparing two SPARC M7 cores vs 16 E5-2699 cores. It does however help answer the question -- if you're running on a certain number of Intel cores what kind of performance can you expect when you move over to a heavy threaded M7 if you transfer your workload "as is"?

Below are the results:



The first thing worth mentioning is that M7 still exhibits a large deficit when it comes to single threaded performance -- a single SPARC M7 thread is about 60% performance compared to an E5-2699 V4 core (which is not even a top bin CPU when it comes to frequency). Throughput is a different story -- two M7 cores are almost able to match four E5-2699 V4 cores thanks to a heavily threaded design.

Thursday, July 23, 2015

ORA-15410: Disks in disk group do not have equal size

12.1.0.2 ASM introduced a new feature which might prevent you from adding dissimilar size disks into a normal or high redundancy disk groups. The relevant MOS note is Doc ID 1938950.1.

Unfortunately I've found that some information in this note requires further clarification.

COMPATIBLE.ASM

The note suggests that the check is only enforced when the Disk Group has 'COMPATIBLE.ASM'='12.1.0.2' attribute set. This is incorrect and can be easily demonstrated:
SQL> select name, compatibility from v$asm_diskgroup;

NAME                           COMPATIBILITY
------------------------------ --------------------
DATA                           11.2.0.3.0

SQL> alter diskgroup data add disk '/dev/sdc';
alter diskgroup data add disk '/dev/sdc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15410: Disks in disk group DATA do not have equal size.
As you can see my Disk Group has 'COMPATIBLE.ASM'='11.2.0.3.0' which did not prevent it from getting ORA-15410.

What happens if you have disks of dissimilar sizes already in the Disk Group?

Let's say you've upgraded from a previous release and your normal/high redundancy disk group had disks of dissimilar sizes in it already. What happens when you try to add a disk after the upgrade to 12.1.0.2 ASM? Some of the questions that immediately jump to mind:
  • Will it always rise an error because the existing disks have different sizes forcing you to make everything of the same size first?
  • Will it let you add a disk as long as the size is the same to one of the disks you already have?
  • Will it let you add a disk with a new size?
Here I have a 12.1.0.2 normal redundancy Disk Group with two disks of dissimilar size:
SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
Let's add a third disk of some other size:
SQL> alter diskgroup data add disk '/dev/sdd' size 2g;

Diskgroup altered.

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
DATA       NORMAL 12.1.0.2.0           DATA_0002        2048
As you can see, as long as you have disks of dissimilar sizes already in the disk group, the check appears to be ignored. I can resize the disks as well:
SQL> alter diskgroup data resize disk DATA_0002 size 4g;

Diskgroup altered.

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;  2    3    4    5    6    7    8

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
DATA       NORMAL 12.1.0.2.0           DATA_0002        4096
Now what happens if I drop DATA_0001 thus making all disks to be of equal size?
SQL> alter diskgroup data drop disk DATA_0001;

Diskgroup altered.

SQL> alter diskgroup data add disk '/dev/sdc';
alter diskgroup data add disk '/dev/sdc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15410: Disks in disk group DATA do not have equal size.
It appears that the moment the Disk Group got all disks of the same size the check got enforced.

Conclusion

Based on the tests I've demonstrated above the rules for ORA-15410 on 12.1.0.2 ASM appears to be:
  • If all the disks in a Disk Group are of the same size then the check is enforced, regardless of what 'COMPATIBLE.ASM' value is.
  • If a Disk Group contains disks of different sizes then the check is ignored, regardless of what 'COMPATIBLE.ASM' value is.
(keeping in mind that the check can only happen for normal/high redundancy Disk Groups)

Wednesday, July 15, 2015

Wrong Results

It is interesting how a combination of technologies in Oracle can play in a way which produce a seemingly bizarre outcomes.

Consider the following query:
SQL> with v as
(
        select 20 n from dual
) select distinct v.n
                from v, t
                where v.n=t.n(+);

no rows selected
Note that I'm doing a left outer join, however, the query somehow managed to loose a single row I have in the subquery factoring (and just in case you're wondering I've used subquery factoring for simplicity and replacing it with a real table makes no difference).

The first reaction is how could something as simple as this go so terribly wrong?

Let's take a look at the plan:
SQL> with v as
(
        select 20 n from dual
) select distinct v.n
                from v, t
                where v.n=t.n(+);  2    3    4    5    6

Execution Plan
----------------------------------------------------------
Plan hash value: 627307704

----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |      |     1 |     9 |     6  (17)| 00:00:01 |
|   1 |  HASH UNIQUE                    |      |     1 |     9 |     6  (17)| 00:00:01 |
|*  2 |   FILTER                        |      |       |       |            |          |
|*  3 |    HASH JOIN OUTER              |      |     1 |     9 |     5   (0)| 00:00:01 |
|   4 |     VIEW                        |      |     1 |     3 |     2   (0)| 00:00:01 |
|   5 |      FAST DUAL                  |      |     1 |       |     2   (0)| 00:00:01 |
|*  6 |     MAT_VIEW REWRITE ACCESS FULL| MV_T |    10 |    60 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MV_T"."GID"=0)
   3 - access("V"."N"="MV_T"."N"(+))
   6 - filter("MV_T"."N"(+) IS NOT NULL)
As it turns out the query rewrite is in play but the real oddity lies in the Predicate Information section. Line #2 reads filter("MV_T"."GID"=0) which took me by surprise (where did it come from?) as well as explained why the row went missing. A predicate like that essentially turned our query into an inner join!

The answer lies in how this materialized view was created and the cool trick the optimizer tried to do which didn't quite work out. Indeed, I have created the table and the materialized view in the following way:
SQL> create table t as
  2   select mod(level, 10) n, level m
  3    from dual
  4    connect by level <= 1000;
Table created

SQL> create materialized view mv_t enable query rewrite as
  2   select n, sum(m), grouping_id(n) gid
  3    from t
  4    group by rollup(n);
Materialized view created
The materialized view has two levels of aggregation due to the use of a rollup. I'm also using a grouping_id function to identify each grouping level. So what the optimizer tried to do is apply the GID = 0 predicate in order to eliminate the rollup data but, unfortunately, it also turned our query into an inner join equivalent as a result (final query taken from the optimizer trace):
SELECT DISTINCT "V"."N" "N"
 FROM (SELECT 20 "N" FROM "SYS"."DUAL" "DUAL") "V", "ROOT"."MV_T" "MV_T"
 WHERE "MV_T"."GID" = 0
  AND "V"."N" = "MV_T"."N"(+)
Of course the correct predicate in this case should be "MV_T"."GID" (+) = 0.

I have tried the above test case on both 11.2.0.3 and 12.1.0.2.0 with both versions producing the same wrong results.

Saturday, March 07, 2015

Converting non-CDB database to a PDB when TDE is in use

Converting a non-CDB database to a PDB is a rather straightforward process. However once TDE (Transparent Data Encryption) gets involved certain things become not so obvious so I've decided to write a small guide on how to accomplish that. In order for a non-CDB database to be converted to a PDB it's version needs to be at least 12.1.0.2.0.

Encryption Wallet Location

My encryption wallet location is set to the following (sqlnet.ora):
ENCRYPTION_WALLET_LOCATION=
  (SOURCE =
    (METHOD = FILE)
    (METHOD_DATA =
      (DIRECTORY = /u01/app/oracle/admin/$ORACLE_SID/wallet/tde)
    )
  )
Create a non-CDB database

Let's start by creating a non-CDB database. I will call this database db1 and it will later be converted to a pluggable database called pdb1:
dbca -silent \
-createDatabase \
-templateName New_Database.dbt \
-gdbName db1 \
-createAsContainerDatabase false \
-sysPassword oracle \
-systemPassword oracle \
-emConfiguration none \
-datafileDestination /u02/oradata \
-redoLogFileSize 128 \
-recoveryAreaDestination /u02/fra \
-storageType FS \
-characterSet al32utf8 \
-nationalCharacterSet al16utf16 \
-automaticMemoryManagement false \
-initParams filesystemio_options=setall \
-initParams session_cached_cursors=100 \
-totalMemory 1024
Creating and starting Oracle instance
1% complete
...
100% complete
Since I have PSU2 applied I need to run datapatch once the database has been created:
[oracle@ora12cr1 ~]$ . oraenv
ORACLE_SID = [oracle] ? db1
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ora12cr1 ~]$ $ORACLE_HOME/OPatch/datapatch -verbose
Create the wallet and encrypted table

We're now ready to setup the wallet and create a user with an encrypted table:
[oracle@ora12cr1 ~]$ mkdir -p /u01/app/oracle/admin/db1/wallet/tde
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 15:51:21 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter system set encryption key identified by "db1";

System altered.

SQL> alter system set db_create_file_dest='/u02/oradata';

System altered.

SQL> create tablespace encrypted datafile size 64m encryption using 'AES256' default storage (encrypt);

Tablespace created.

SQL> create user encrypt_user identified by "encrypt_user" default tablespace encrypted;

User created.

SQL> alter user encrypt_user quota unlimited on encrypted;

User altered.

SQL> create table encrypt_user.z_encrypted as
        select dbms_random.string('x', 100) s
                from dual
                connect by level <= 100;  2    3    4

Table created.
Note that I did not use the new administer key management syntax on purpose and instead created the wallet the old style way as I expect most of the existing environments to be the same. We will explore the significance of this in a moment.

Create a CDB

Before our non-CDB database can be converted to a PDB and plugged into a CDB we need to create the container database first:
[oracle@ora12cr1 ~]$ dbca -silent \
> -createDatabase \
> -templateName New_Database.dbt \
> -gdbName cdb12cr1 \
> -createAsContainerDatabase true \
> -sysPassword oracle \
> -systemPassword oracle \
> -emConfiguration none \
> -datafileDestination /u02/oradata \
> -redoLogFileSize 128 \
> -recoveryAreaDestination /u02/fra \
> -storageType FS \
> -characterSet al32utf8 \
> -nationalCharacterSet al16utf16 \
> -automaticMemoryManagement false \
> -initParams filesystemio_options=setall \
> -initParams session_cached_cursors=100 \
> -totalMemory 1024
Creating and starting Oracle instance
1% complete
...
100% complete
As before I need to run the datapatch utility:
[oracle@ora12cr1 ~]$ . oraenv
ORACLE_SID = [oracle] ? cdb12cr1
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ora12cr1 ~]$ $ORACLE_HOME/OPatch/datapatch -verbose

Setup CDB with a wallet

Our newly created CDB needs to be setup with a wallet before we proceed with plugging a PDB which utilizes TDE. Let's do it now:
[oracle@ora12cr1 ~]$ mkdir -p /u01/app/oracle/admin/cdb12cr1/wallet/tde
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 17:23:58 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> administer key management
        create keystore '/u01/app/oracle/admin/cdb12cr1/wallet/tde'
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        set key identified by "cdb12cr1"
        with backup;  2    3

keystore altered.

Here I have setup the wallet and the master encryption key using the new syntax.

Wallets

Before we move forward let's explore the differences between db1 (created using the old syntax) and cdb12cr1 (created using the new syntax) wallets:
[oracle@ora12cr1 ~]$ orapki wallet display -wallet /u01/app/oracle/admin/db1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
Trusted Certificates:
[oracle@ora12cr1 ~]$  orapki wallet display -wallet /u01/app/oracle/admin/cdb12cr1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.Ad+A607CLU+Ivx4f4E7KUYIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
ORACLE.SECURITY.KM.ENCRYPTION.Ad+A607CLU+Ivx4f4E7KUYIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Trusted Certificates:
Notice how db1 has the encryption key listed under ORACLE.SECURITY.DB.ENCRYPTION while cdb12cr1 in addition has the encryption key listed under the new ORACLE.SECURITY.KM.ENCRYPTION as well.

In practice what I found is unless your encryption key is listed under ORACLE.SECURITY.KM.ENCRYPTION it will not be exported when doing administer key management export keys. As a result when you import the export file the master key required to decrypt the data will not be there potentially leaving you in a peculiar situation especially if you have used an in-place conversion. Even if you specifically export the key using the with identifier in syntax you will not be able to import the key because specific key exports are not permitted to be imported into the PDBs.

Add ORACLE.SECURITY.KM.ENCRYPTION to db1's wallet

This step is only required if you created the wallet without using the new administer key management syntax. Re-keying the wallet will generate a new master key preserving the old master key necessary to decrypt the data while adding missing ORACLE.SECURITY.KM.ENCRYPTION entries at the same time. Execute while connected to db1:
SQL> administer key management
        set key identified by "db1"
        with backup;  2    3

keystore altered.
If we look at the wallet we can see that it now contains the necessary entires:
[oracle@ora12cr1 ~]$ orapki wallet display -wallet /u01/app/oracle/admin/db1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.AXZlzWtP/U/xv+vihPzeaGkAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
ORACLE.SECURITY.KM.ENCRYPTION.AXZlzWtP/U/xv+vihPzeaGkAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.KM.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.KM.ENCRYPTION.MASTERKEY
Trusted Certificates:
Prepare db1 to be converted into a PDB

Before db1 can be plugged into a container database it needs to be converted to a PDB and it's encryption keys exported. Shutdown db1 and open it in read only:
SQL> show parameter db_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      db1
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount restrict
ORACLE instance started.

Total System Global Area 1241513984 bytes
Fixed Size                  2923872 bytes
Variable Size             452985504 bytes
Database Buffers          771751936 bytes
Redo Buffers               13852672 bytes
Database mounted.
SQL> alter database open read only;

Database altered.
We can now export the encryption keys:
SQL> administer key management
        set keystore open
        identified by "db1";  2    3

keystore altered.

SQL> administer key management
        export keys with secret "db1"
        to '/u01/app/oracle/admin/cdb12cr1/wallet/tde/db1.exp'
        identified by "db1";  2    3    4

keystore altered.
Note that in order for the export operation to work the wallet needs to be explicitly opened with a password otherwise you will receive an error. In case of an auto login (local) wallet you will have to close the wallet and reopen it with a password.

The next step is to generate metadata necessary for PDB conversion and shutdown db1 database:
SQL> exec dbms_pdb.describe(pdb_descr_file => '/u01/app/oracle/db1.xml');

PL/SQL procedure successfully completed.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
Convert and plug db1 into a CDB

We can now login into cdb12cr1 and perform in-place conversion of db1 which does not requires any datafiles to be copied. Of course if something were to go wrong with the conversion process you might end up in a situation where you need to restore your original database from a backup so use this approach with care.
SQL> show parameter db_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      cdb12cr1
SQL> create pluggable database pdb1
        using '/u01/app/oracle/db1.xml'
        nocopy tempfile reuse;  2    3

Pluggable database created.
Before pdb1 can be opened we need to run the script which will convert pdb1's data dictionary:
SQL> alter session set container=pdb1;

Session altered.

SQL> @?/rdbms/admin/noncdb_to_pdb.sql
Once the script completes we can open pdb1:
SQL> alter pluggable database pdb1 open;

Warning: PDB altered with errors.
The error while opening the PDB tells us that the encryption key is missing (can be seen in pdb_plug_in_violations view). Let's go and import the key now:
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 18:22:23 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter session set container=pdb1;

Session altered.

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        import keys with secret "db1"
        from '/u01/app/oracle/admin/cdb12cr1/wallet/tde/db1.exp'
        identified by "cdb12cr1"
        with backup;  2    3    4    5

keystore altered.

SQL> alter pluggable database pdb1 close;

Pluggable database altered.

SQL> alter pluggable database pdb1 open;

Pluggable database altered.
The encryption keys are now imported and we no longer get an error. Note that as with the export in order to import the keys the wallet must be explicitly opened with a password. We can verify that everything is in order by querying the encrypted table:
SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> select count(*) from encrypt_user.z_encrypted;

  COUNT(*)
----------
       100

Monday, September 08, 2014

Zone Maps On Commit Refresh Oddities

One of the ways Zone Maps can be refreshed when the underlying table data is changed is fast on commit. This is similar to how materialized views can be refreshed with the exception that a Zone Map does not need a materialized view log to do so.

It can also lead to some peculiar side effects.

Test setup

Let's begin by creating a test table with the on commit refresh materialized zone map:
SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
 
SQL> create materialized zonemap zm$t refresh fast on commit on t (n);
 
Done

SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
Zone Staleness (first session)

When a session updates a row in the table it will also mark the corresponding zone in the zone map as stale by setting ZONE_STATE$=1:
SQL> update t set n=0 where n=1;
 
1 row updated
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           1        748 <=== Zone marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
A stale zone is always scanned when performing Zone Map pruning regardless of the predicates involved. This makes sense otherwise we won't be able to select the updated row back while performing Zone Map pruning at the same time because the value is now out of range. With the zone marked as stale we have no such problem:
SQL> select n from t where n=0;
 
         N
----------
         0
A zone will be marked as fresh when the session commits by setting ZONE_STATE$=0. Note that our session did not commit yet.

Second session

So what happens if some other session updates a row belonging to the stale zone map and then commits? Does the zone gets invalidated by being considered fresh again? Let's find out what happens:
SQL> update t set n=2 where n=2;
 
1 row updated
 
SQL> commit;
 
Commit complete
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748 <=== Zone is no longer marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
Indeed the zone is no longer marked as stale! Now let's go back to our first session...

First session

So what happens if we try to select the same updated row in the first session?
SQL> select n from t where n=0;

no rows selected
We've lost the ability to see the updated row thanks to the second session invalidating the zone! Worse yet we still won't be able to see the row even after this session commits meaning any query in the database which relies on this Zone Map to do the pruning will be unable to see the data either:
SQL> commit;

Commit complete.

SQL> select n from t where n=0;

no rows selected

SQL> select /*+ no_zonemap(t scan) */ n from t where n=0;

         N
----------
         0
The only way to select the row is to disable Zone Map pruning.

The end result is we have a perfectly "valid" Zone Map as long as Oracle is concerned which is out of sync with the actual table data:
SQL> select zonemap_name, invalid, stale, unusable, compile_state from dba_zonemaps;
 
ZONEMAP_NAME INVALID STALE   UNUSABLE COMPILE_STATE
------------ ------- ------- -------- -------------------
ZM$T         NO      NO      NO       VALID 

The fact that Oracle documentation claims that on commit refresh zone maps stay transactionally fresh suggests that this behavior is clearly a bug resulted from incorrect handling of concurrent transactions modifying the data belonging to the same zone.