Many of the Linux analysts at Gartner, including myself, have received customer inquiries regarding Oracle’s lack of certification and support of Oracle Database 11g R2 on Red Hat Enterprise Linux (RHEL) 6 and Oracle’s own Red Hat compatible kernel in Oracle Linux (OL) 6. Oracle does certify and support Oracle Database 11g R2 on its Unbreakable Enterprise Kernel (UEK – including the recently released R2) that is a more recent hardening of the Linux mainline kernel with a focus on performance and scalability for Oracle database, middleware, and applications.
Fortunately, Oracle recently announced that Database 11g R2 and Oracle Fusion Middleware 11gR1 will be supported on RHEL 6 and OL 6 running the Red Hat Compatible Kernel within 90 days of their March 22, 2012 announcement. So this brings me to the point of my blog. Conspiracy theorists love to think of all sorts of reasons for Oracle’s non-support of RHEL 6 up to this point; from Oracle wanting to destroy Red Hat (which I can’t see would help them or anyone for that matter) to who-knows-what, but tend to gloss over the non-glamorous possibilities. Many of you know that I worked for a software vendor for over 20 years; specifically a systems software vendor. At one point early in my career, I started an engineering “special ops” team. Think of these engineers as “Navy Seals” specifically assigned to track down, pinpoint, and fix software issues reported by customers. We called this team the “Critical Problem Resolution” team – or CPR for short (yes – “CPR” was chosen for its affiliation to saving lives in the medical field). In that role, I learned a whole lot about debugging problems, especially interactions between multiple vendors’ software when integrated into a complete system. Also I learned of the vast difference between systems debugging and application debugging – two different worlds. Bottom line: Debugging and fixing a problem is proportional to the ease with which the problem can be reproduced. Intermittent problems were my worst nightmare. We would only get a whack at the problem very infrequently, meaning that it would take us a whole lot of time to pinpoint the issue in order to solve it. We loved easy to reproduce problems as we could fix or work-around them quickly. However, there would occasionally arise an issue that we would discover is founded in an architectural mistake – those were difficult to fix, as often fixing it for one situation would break other configurations. Often times systems engineers resolve these by adding configuration switches to alter behavior based on what is configuration it is run in.
The Oracle Database Quality Assurance team has a battery of test suites that they run through in order to qualify the data base on each platform. Now, an RDBMS is a unique piece of code (actually lots of modules) in that it is both systems and applications in the nature of its operations. In addition, RDBMS’s have the ability to stress an operating system and underlying hardware more so than other applications and systems. Stress testing to hundreds of threads that are all contending for shared resources, allocating and freeing memory in many different sizes and ways, spin-locking across multiple cores, and hitting the I/O channels harder than most anything else. Now add onto this, configuration testing: Oracle’s support of Oracle Database 11g R2 on RHEL 6 also includes the support of the Oracle Real Application Cluster option (RAC). RAC is yet another animal and has a lot of dependence on tight OS timing, hence sensitivity to timing issues in the underlying hardware and operating system. Remember that RAC must synchronize RDBMS state between all of the nodes in the cluster and with every RAC node changing state in response to SQL queries; these must be coordinated and synchronized across the entire cluster. In addition, node failure, removal, or addition must also be properly brought in and out of this synchronized system (managed by a distributed lock manager – DLM). Cluster DLM’s are the fiercest of all systems on an operating system. The amount of testing is truly intense and a DLM will find all sorts of hidden timing issues in today’s complex SMP operating systems and hardware. RDBMS testing is a unique beast, and as expected can stress a system and find subtle errors and timing windows that no other tests are able to find. As a result, all issues that Oracle finds, regardless of their sources, have to be resolved prior to supporting the RDBMS on any given platform, and this could take time. Furthermore fixing any issues found can be done in a number of ways: repair in the operating system code (Linux), repair in the RDBMS code (such as change to use Upstart instead of init as Red Hat moved from the System V init method to the newer Linux Upstart in RHEL 6), or work-around in either place. Unfortunately, the scope of the fix (especially if it is a work-around in the RDBMS code) typically results in a full run-through of all the test cases. With something as complex as an RDBMS, this can take a long time.
In summary, all the indications are that Oracle has been working through many issues with Oracle Database 11g R2 on RHEL and OL 6 kernel and were not prepared to certify and support it on those platforms until they were confident those issues had been resolved. I had learned back in my “CPR” days that promising a fix in a given amount of time would get me in trouble if I did so prior to the engineering team pin-pointing the actual issue. But management would always push me hard for a date/time for a fix. Can you say “being between a rock and a hard place?”