You are on page 1of 26

IT-DB

LCG

Oracle Services at CERN

http://cern.ch/jamie/ATLAS-March04.ppt
Jamie Shiers, IT-DB

http://cern.ch/db/
Overview
Oracle Service Status and Outlook
Database and Application Services

Oracle Contract Status

Oracle and the CERN openlab

Oracle Technology Update


Oracle 10g: Database, Application Server and Grid Control

A few comments on ATLAS presentation to PEB (Oracle issues only)

How Oracle Might Fit into an ATLAS Distributed Database


Deployment Strategy (also from ATLAS to PEB)

Summary
Oracle at CERN
Originally chosen in LEP construction phase (1983)
Database and Application Services

Services have expanded across laboratory:


Now used in all areas of the labs work
Accelerator construction, operation, deconstruction (LEP),
AIS applications (CET, EDH, HRT, )
Detector construction, calibration,
Physics related services: LCG RLS, COMPASS / HARP event-level
meta-data
And from Oracle Database to DB + Application Server + Grid Control
e.g. RLS, AIS apps

Total number of services ~100


Sun Cluster, single instance Suns, DiskServers running RHEL,
Total non-physics data: few 100s of GB (400GB alone for AIS)
Total physics data (COMPASS): few TB
Oracle Services for Physics
Database and Application Services

2 node Sun Cluster with ~350GB of FC attached disk


Storage space allocated (so far...) via COCOTIME
Runs Oracle Real Application Cluster(s) (RAC)
High availability solution giving transparent application failover in case of
h/w problems
Complemented by various Disk Servers running RHEL
For applications requiring 100GB 1TB of storage
Application Servers? ALICE only so far except

LCG: RLS / RMC


Application Server (farm node) per VO
Shared DB backend
Similar configs for test and certification

First deployments at outside sites (CNAF, FZK, RAL, )


This includes (production?) tests of WAN async. replication
Physics Services - Outlook
Planning for more applications, more data, more users
Database and Application Services

Non-event data: assuming a few hundred GB per experiment

Event data: Collections? Event-level meta-data?

Have to plan for COMPASS level, = 1% of data volume =


DBs in 10TB / 100TB (with time) range

Probably not a problem, but will require use of new features in 10g
and beyond (for ULDB, storage management etc.)

Build on common strategies (later) and establish SLAs?

Continue with RAC + disk servers?


IT-DB

LCG

Oracle Services for LCG


Goals
To offer production quality services for LCG to meet the
Database and Application Services


requirements of current (and future) data challenges
e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC04

To provide distribution kits, scripts and documentation to assist


other sites in offering production services

To leverage the many years experience in running such services at


CERN and other institutes
Monitoring, backup & recovery, tuning, capacity planning,

To understand experiments requirements in how these services


should be established, extended and clarify current limitations

Not targeting small-medium scale DB apps that need to be run and


administered locally (to user)
What Services?
Database and Application Services

POOL file catalogue using EDG-RLS (also non-POOL!)


LRC + RLI services + client APIs
For GUID <-> PFN mappings
and EDG-RMC
For file-level meta-data: POOL currently stores:
filetype (e.g. ROOT file), fully registered, job status
Expect also ~10 items from CMS DC04: others?
plus (service behind) EDG Replica Manager client tools

Need to provide robustness, recovery, scalability, performance,

File catalogue is a critical component of the Grid!


Job scheduling, data access,
What if? ?
b le
ila
va
Database and Application Services

DB server dies

na
No access to catalog until new server configured & DB restored
is u
l o g
Hot standby or clustered solution offers protection against most common cases

Application server dies c ta


Regular dump of full catalog into alternate format, e.g. POOL XML?
a
he n
Stateless, hence relatively simple move to a new host
Could share with another VO d w

ri o
e
Handled automatically with application server clusters
p
Data corrupted
i ng
u
Restore or switchr to alternate catalog
s d

e n
Software problems
p to predict and protect against
p
Hardest
a cause running jobs to fail and drain batch queues!
hCould
a t Very careful testing, including by experiments, before move to a new version of the
W h
middleware (weeks, including smallish production run?)
Need to foresee all possible problems, establish recovery plan and test!
RLS - Handling Interventions
iAS: can transparently switch to new box using DNS alias change
Database and Application Services

Used for both scheduled and unscheduled interruptions


Web-based tool uses SOAP interface to IT-CS supported solution (not NETOPS)

DB: short interruption to move to stand-by DB


Does not currently use Oracle Data Guard (aka stand-by DB)

Standard Oracle solutions:


iAS clusters
DB clusters (RAC, requires SAN storage)
DataGuard
(Replication)

Evaluating these using openlab fellows for eventual deployment H2 2004(?)


My guess: medium term will use iAS clusters & DataGuard, RAC?
And maybe also replication as temporary inter-site solution?
Longer term solution for T0 + T1s?
Oracle Contract
Database and Application Services

Previously based on active users, auditing via CERN tools


Highly non-standard; obsolete list of products and machines
Maintenance costs were growing at 15% compound

New contract based on named users (standard Oracle license)


Platform independent
Location independent
iAS licenses dramatically increased
Maintenance costs reduced and are fixed for 5 years
Extended to all CERN staff + users (HR numbers)
s/w can be installed and run at collaborating institutes for CERN work

0 Double edged sword support issues for outside use a big concern
Oracle Distribution
Users (sites) must register (group OR) and sign contract addendum
Database and Application Services

Oracle DB and iAS packaged and redistributed as basis of file


(metadata) catalog for LCG
Well-defined application, well shielded from users
Defined version of Oracle components, single supported platform (RHEL)
Tools / kits now used within the IT-DB group for CERN services
Also at a few Tier1 sites (CNAF, FZK, RAL, )

First non-LCG customer: COMPASS


Offload significant(?) fraction of their production / analysis
Requires local Oracle expertise
Bulk data distribution: few times per year transportable tablespaces

Requests from other groups in queue; proceeding at rate we can


support without impacting CERN production services
Not targeting general purpose DB services outside (Yet? Ever?)
Other Oracle Distributions
Database and Application Services

Client run-time
In principle solved by Oracle 10g instant client - available
To be tested Satisfies ATLAS requirements? (Luc Goossens)

Client developer
Could be an additional RPM to above; copy of CERN AFS tree?
To be prepared and tested

Neither of the above work-items currently scheduled


RLS kit not really suitable just for client usage

Big concern is additional support load


Need to reproduce problem at CERN? Beyond that?
Oracle and the openlab
Oracle is now a sponsor of the CERN openlab
Database and Application Services

Funding 2 CERN fellows


One focusing on core technology, other on LCG / physics aspects
Also evaluate features of potential benefit to AIS (and hence lab)

Direct links to development teams in Europe and US

Program of work driven by service needs


Initially for RLS and CASTOR services
Investigations into Data Guard, Replication and RAC,
Later 10g and other investigations

Other contributions include OCP training


Increases value of OCP
Helps attract good (short-term) people
Oracle 10g
Database and Application Services

Latest release of Oracle core technology


Includes Database, Application Server (9.0.4) and
Enterprise Manager (now renamed Grid Control)

The g stands for Grid


Enterprise Grid, rather than Scientific Grid

What does this really mean?


Emphasis on clusters / farms of blades running Linux
Dynamic reconfiguration of nodes in/out of DB cluster (RAC) / AS
cluster

Lots of nice new features (next)


Some requested by CERN
Oracle Database 10g
Database and Application Services

The self-tuning database (Richard Sarwal embedded)

Support for regular expressions


Native IEEE floats and doubles
ULDB enhancements Big file tablespaces
Automatic Storage Management
RMAN enhancements, e.g. wastebasket (90% of restore requests?)
Instant client install (RPM containing just a couple of shared libraries)
Cross platform transportable tablespaces
Data Pump

Participated in beta programme (very little effort available),


but all relevant features need to be further evaluated
No production plans yet!
Oracle Application Server 10g
In reality, just iAS 9.0.4 renamed
Database and Application Services

Continues trend of becoming ever more powerful / essential part


application deployment infrastructure:
(Web) user interfaces, (XML) web services, Oracle Forms

Numerous iAS concerns in recent months, workarounds have been found


Many con-calls and high-level discussions
Thomas Kurian et al

On paper, AS 10g solves most critical issues and offers awaited


functionality for easier / better deployment: clustering, cloning

* Oracle Forms on the web deployment changes (from GUI to application


server), a lot of Forms will need to be reviewed and moved to a Forms
Application Server

Still an area where more discussion / feedback with Oracle needed


Oracle Enterprise Manager 10g
Database and Application Services

Enterprise Manager 9i is used to manage / monitor some of our


services

EM10g aka Grid Control is a major rewrite of the Oracle


management tool (web based, metric oriented, open repository)

Paper evaluation is very promising (possibility to evaluate trend


usage, implement SLA, access by users).
Early access to the software allowed to provide feedback and
start preparing deployment.

Should be a key component of our tool-set in managing services

Gradually replacing home-built tools with OEM


Service Outlook
Database and Application Services

Current services run well, but are costly!


Streamlining / rationalisation will bring many benefits
Use of common strategies, architectures, building blocks
Higher level of service and / or more with same staffing
e.g. many outstanding requests for iAS services
(including for our own infrastructure services, e.g. Forms!)

Too many services today rely on:


Good will
Good luck

0 This is no basis on which to build the future

Only way to survive with increasing number of DB and iAS servers,


increasing volume of data, increasing complexity of applications, is to
simplify and standardize
Group-wide Strategy (Vision)
Database and Application Services

Standardization
Architectures, configurations, processes
Simplifies management, maintenance and trouble-shooting
Reduce Complexity
Avoid diversification
Improve Security / Safety
Protect against unauthorized access, partial data loss, etc.
Test regularly!
Tools
Acquire and / or create tools to stay in control. Be pro-active
Quality Control
Define deliverables together with customer and measure them!
Service Level Agreements (realistic, measurable) for all services
(Re-)implementing Services
Database and Application Services

Cannot take existing non-HA service, add SLA and make it HA

Several techniques available need to be further studied


openlab et al

Need to discuss options and impact early on with users


e.g. DataGuard, Replication, RAC all offer solutions in area of HA
Most appropriate depends on needs of service
How much downtime can be tolerated?
Pros & cons of multiple cheap boxes versus more expensive (and more
complex) hardware

Learnt a lot from implementing ~24 x 7 services for LCG

Extend to other services as part of group vision (standards!)


Possible H/A DB Solutions
Database and Application Services

Technology RAC DataGuard Replication

Requires special Logical or Finer level of


H/W physical granularity
complete DB
More complex Good for (A)synchronous
than single box reports on
to manage standby
Notes: Best paper Widely used in Performance /
solution? (But conjunction with management
most expensive) RAC overhead?
Real Application Automatic Logical standby Multi-master
Clusters (RAC) = application has some (and more
DB clusters failover (R&W) limitations on complex setups)
schema possible
DataGuard = We are using Physical standby Stop-gap for
Standby DB them! (On Sun is closed distributed
with SAN) RLS? (CMS)
IT-DB
LCG

Comments on ATLAS presentation to PEB (Oracle issues only)


ATLAS presentation to PEB
Database and Application Services

gcc versions

Administrative tools

POOL collections and Oracle


IT-DB
LCG

How Oracle Might Fit into an ATLAS Distributed Database


Deployment Strategy
ATLAS slides to PEB
Database and Application Services

ATLAS would like to see distributed Oracle deployment


model, with the infrastructure and tools and people and
commitment to support it
Persistence RTAG report, endorsed by SC2, recommended
specific steps in this direction. What steps have been taken?

A SUPPORTED heterogeneous deployment infrastructure


would be particularly interesting to ATLAS
Oracle at CERN (or CERN plus selected Tier 1s), MySQL
elsewhere?

You might also like