Data Guard

IT-DB
LCG
Oracle Services at CERN
http://cern.ch/jamie/ATLAS-March04.ppt
Jamie Shiers, IT-DB
http://cern.ch/db/
Overview
Oracle Service Status and Outlook
Database and Application Services
Oracle Contract Status
Oracle and the CERN openlab
Oracle Technology Update

Oracle 10g: Database, Application Server and Grid Control
A few comments on ATLAS presentation to PEB (Oracle issues only)
How Oracle Might Fit into an ATLAS Distributed Database

Deployment Strategy (also from ATLAS to PEB)
Summary
Oracle at CERN
Originally chosen in LEP construction phase (1983)
Services have expanded across laboratory:

Now used in all areas of the labs work
Accelerator construction, operation, deconstruction (LEP),
AIS applications (CET, EDH, HRT, )
Detector construction, calibration,
Physics related services: LCG RLS, COMPASS / HARP event-level
meta-data
And from Oracle Database to DB + Application Server + Grid Control
e.g. RLS, AIS apps
Total number of services ~100

Sun Cluster, single instance Suns, DiskServers running RHEL,
Total non-physics data: few 100s of GB (400GB alone for AIS)
Total physics data (COMPASS): few TB
Oracle Services for Physics
2 node Sun Cluster with ~350GB of FC attached disk

Storage space allocated (so far...) via COCOTIME
Runs Oracle Real Application Cluster(s) (RAC)
High availability solution giving transparent application failover in case of
h/w problems
Complemented by various Disk Servers running RHEL
For applications requiring 100GB 1TB of storage
Application Servers? ALICE only so far except
LCG: RLS / RMC

Application Server (farm node) per VO
Shared DB backend
Similar configs for test and certification
First deployments at outside sites (CNAF, FZK, RAL, )

This includes (production?) tests of WAN async. replication
Physics Services - Outlook
Planning for more applications, more data, more users
Non-event data: assuming a few hundred GB per experiment
Event data: Collections? Event-level meta-data?
Have to plan for COMPASS level, = 1% of data volume =

DBs in 10TB / 100TB (with time) range
Probably not a problem, but will require use of new features in 10g
and beyond (for ULDB, storage management etc.)
Build on common strategies (later) and establish SLAs?
Continue with RAC + disk servers?

IT-DB
LCG
Oracle Services for LCG

Goals
To offer production quality services for LCG to meet the

requirements of current (and future) data challenges
e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC04
To provide distribution kits, scripts and documentation to assist

other sites in offering production services
To leverage the many years experience in running such services at

CERN and other institutes
Monitoring, backup & recovery, tuning, capacity planning,
To understand experiments requirements in how these services

should be established, extended and clarify current limitations
Not targeting small-medium scale DB apps that need to be run and

administered locally (to user)
What Services?
POOL file catalogue using EDG-RLS (also non-POOL!)

LRC + RLI services + client APIs
For GUID <-> PFN mappings
and EDG-RMC
For file-level meta-data: POOL currently stores:
filetype (e.g. ROOT file), fully registered, job status
Expect also ~10 items from CMS DC04: others?
plus (service behind) EDG Replica Manager client tools
Need to provide robustness, recovery, scalability, performance,
File catalogue is a critical component of the Grid!

Job scheduling, data access,
What if? ?
b le
ila
va
DB server dies

na
No access to catalog until new server configured & DB restored
is u
l o g
Hot standby or clustered solution offers protection against most common cases
Application server dies c ta

Regular dump of full catalog into alternate format, e.g. POOL XML?
a
he n
Stateless, hence relatively simple move to a new host
Could share with another VO d w

ri o
e
Handled automatically with application server clusters
p
Data corrupted
i ng
u
Restore or switchr to alternate catalog
s d

e n
Software problems
p to predict and protect against
p
Hardest
a cause running jobs to fail and drain batch queues!
hCould
a t Very careful testing, including by experiments, before move to a new version of the
W h
middleware (weeks, including smallish production run?)
Need to foresee all possible problems, establish recovery plan and test!
RLS - Handling Interventions
iAS: can transparently switch to new box using DNS alias change
Used for both scheduled and unscheduled interruptions

Web-based tool uses SOAP interface to IT-CS supported solution (not NETOPS)
DB: short interruption to move to stand-by DB

Does not currently use Oracle Data Guard (aka stand-by DB)
Standard Oracle solutions:

iAS clusters
DB clusters (RAC, requires SAN storage)
DataGuard
(Replication)
Evaluating these using openlab fellows for eventual deployment H2 2004(?)

My guess: medium term will use iAS clusters & DataGuard, RAC?
And maybe also replication as temporary inter-site solution?
Longer term solution for T0 + T1s?
Oracle Contract
Previously based on active users, auditing via CERN tools

Highly non-standard; obsolete list of products and machines
Maintenance costs were growing at 15% compound
New contract based on named users (standard Oracle license)

Platform independent
Location independent
iAS licenses dramatically increased
Maintenance costs reduced and are fixed for 5 years
Extended to all CERN staff + users (HR numbers)
s/w can be installed and run at collaborating institutes for CERN work
0 Double edged sword support issues for outside use a big concern
Oracle Distribution
Users (sites) must register (group OR) and sign contract addendum
Oracle DB and iAS packaged and redistributed as basis of file

(metadata) catalog for LCG
Well-defined application, well shielded from users
Defined version of Oracle components, single supported platform (RHEL)
Tools / kits now used within the IT-DB group for CERN services
Also at a few Tier1 sites (CNAF, FZK, RAL, )
First non-LCG customer: COMPASS

Offload significant(?) fraction of their production / analysis
Requires local Oracle expertise
Bulk data distribution: few times per year transportable tablespaces
Requests from other groups in queue; proceeding at rate we can

support without impacting CERN production services
Not targeting general purpose DB services outside (Yet? Ever?)
Other Oracle Distributions
Client run-time
In principle solved by Oracle 10g instant client - available
To be tested Satisfies ATLAS requirements? (Luc Goossens)
Client developer
Could be an additional RPM to above; copy of CERN AFS tree?
To be prepared and tested
Neither of the above work-items currently scheduled

RLS kit not really suitable just for client usage
Big concern is additional support load

Need to reproduce problem at CERN? Beyond that?
Oracle and the openlab
Oracle is now a sponsor of the CERN openlab
Funding 2 CERN fellows

One focusing on core technology, other on LCG / physics aspects
Also evaluate features of potential benefit to AIS (and hence lab)
Direct links to development teams in Europe and US
Program of work driven by service needs

Initially for RLS and CASTOR services
Investigations into Data Guard, Replication and RAC,
Later 10g and other investigations
Other contributions include OCP training

Increases value of OCP
Helps attract good (short-term) people
Oracle 10g
Latest release of Oracle core technology

Includes Database, Application Server (9.0.4) and
Enterprise Manager (now renamed Grid Control)
The g stands for Grid

Enterprise Grid, rather than Scientific Grid
What does this really mean?

Emphasis on clusters / farms of blades running Linux
Dynamic reconfiguration of nodes in/out of DB cluster (RAC) / AS
cluster
Lots of nice new features (next)

Some requested by CERN
Oracle Database 10g
The self-tuning database (Richard Sarwal embedded)
Support for regular expressions

Native IEEE floats and doubles
ULDB enhancements Big file tablespaces
Automatic Storage Management
RMAN enhancements, e.g. wastebasket (90% of restore requests?)
Instant client install (RPM containing just a couple of shared libraries)
Cross platform transportable tablespaces
Data Pump
Participated in beta programme (very little effort available),

but all relevant features need to be further evaluated
No production plans yet!
Oracle Application Server 10g
In reality, just iAS 9.0.4 renamed
Continues trend of becoming ever more powerful / essential part

application deployment infrastructure:
(Web) user interfaces, (XML) web services, Oracle Forms
Numerous iAS concerns in recent months, workarounds have been found

Many con-calls and high-level discussions
Thomas Kurian et al
On paper, AS 10g solves most critical issues and offers awaited

functionality for easier / better deployment: clustering, cloning
* Oracle Forms on the web deployment changes (from GUI to application

server), a lot of Forms will need to be reviewed and moved to a Forms
Application Server
Still an area where more discussion / feedback with Oracle needed

Oracle Enterprise Manager 10g
Enterprise Manager 9i is used to manage / monitor some of our

services
EM10g aka Grid Control is a major rewrite of the Oracle

management tool (web based, metric oriented, open repository)
Paper evaluation is very promising (possibility to evaluate trend

usage, implement SLA, access by users).
Early access to the software allowed to provide feedback and
start preparing deployment.
Should be a key component of our tool-set in managing services
Gradually replacing home-built tools with OEM

Service Outlook
Current services run well, but are costly!

Streamlining / rationalisation will bring many benefits
Use of common strategies, architectures, building blocks
Higher level of service and / or more with same staffing
e.g. many outstanding requests for iAS services
(including for our own infrastructure services, e.g. Forms!)
Too many services today rely on:

Good will
Good luck
0 This is no basis on which to build the future
Only way to survive with increasing number of DB and iAS servers,

increasing volume of data, increasing complexity of applications, is to
simplify and standardize
Group-wide Strategy (Vision)
Standardization
Architectures, configurations, processes
Simplifies management, maintenance and trouble-shooting
Reduce Complexity
Avoid diversification
Improve Security / Safety
Protect against unauthorized access, partial data loss, etc.
Test regularly!
Tools
Acquire and / or create tools to stay in control. Be pro-active
Quality Control
Define deliverables together with customer and measure them!
Service Level Agreements (realistic, measurable) for all services
(Re-)implementing Services
Cannot take existing non-HA service, add SLA and make it HA
Several techniques available need to be further studied

openlab et al
Need to discuss options and impact early on with users

e.g. DataGuard, Replication, RAC all offer solutions in area of HA
Most appropriate depends on needs of service
How much downtime can be tolerated?
Pros & cons of multiple cheap boxes versus more expensive (and more
complex) hardware
Learnt a lot from implementing ~24 x 7 services for LCG
Extend to other services as part of group vision (standards!)

Possible H/A DB Solutions
Technology RAC DataGuard Replication
Requires special Logical or Finer level of

H/W physical granularity
complete DB
More complex Good for (A)synchronous
than single box reports on
to manage standby
Notes: Best paper Widely used in Performance /
solution? (But conjunction with management
most expensive) RAC overhead?
Real Application Automatic Logical standby Multi-master
Clusters (RAC) = application has some (and more
DB clusters failover (R&W) limitations on complex setups)
schema possible
DataGuard = We are using Physical standby Stop-gap for
Standby DB them! (On Sun is closed distributed
with SAN) RLS? (CMS)
IT-DB
LCG
Comments on ATLAS presentation to PEB (Oracle issues only)

ATLAS presentation to PEB
gcc versions
Administrative tools
POOL collections and Oracle

IT-DB
LCG
How Oracle Might Fit into an ATLAS Distributed Database

Deployment Strategy
ATLAS slides to PEB
ATLAS would like to see distributed Oracle deployment

model, with the infrastructure and tools and people and
commitment to support it
Persistence RTAG report, endorsed by SC2, recommended
specific steps in this direction. What steps have been taken?
A SUPPORTED heterogeneous deployment infrastructure

would be particularly interesting to ATLAS
Oracle at CERN (or CERN plus selected Tier 1s), MySQL
elsewhere?

Data Guard

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Guard

Uploaded by

Copyright:

Available Formats

IT-DB

Oracle Services at CERN

Oracle Contract Status

Oracle and the CERN openlab

Oracle Technology Update

A few comments on ATLAS presentation to PEB (Oracle issues only)

How Oracle Might Fit into an ATLAS Distributed Database

Services have expanded across laboratory:

Total number of services ~100

2 node Sun Cluster with ~350GB of FC attached disk

LCG: RLS / RMC

First deployments at outside sites (CNAF, FZK, RAL, )

Non-event data: assuming a few hundred GB per experiment

Event data: Collections? Event-level meta-data?

Have to plan for COMPASS level, = 1% of data volume =

Build on common strategies (later) and establish SLAs?

Continue with RAC + disk servers?

Oracle Services for LCG

To provide distribution kits, scripts and documentation to assist

To leverage the many years experience in running such services at

To understand experiments requirements in how these services

Not targeting small-medium scale DB apps that need to be run and

POOL file catalogue using EDG-RLS (also non-POOL!)

Need to provide robustness, recovery, scalability, performance,

File catalogue is a critical component of the Grid!

Application server dies c ta

Used for both scheduled and unscheduled interruptions

DB: short interruption to move to stand-by DB

Standard Oracle solutions:

Evaluating these using openlab fellows for eventual deployment H2 2004(?)

Previously based on active users, auditing via CERN tools

New contract based on named users (standard Oracle license)

Oracle DB and iAS packaged and redistributed as basis of file

First non-LCG customer: COMPASS

Requests from other groups in queue; proceeding at rate we can

Neither of the above work-items currently scheduled

Big concern is additional support load

Funding 2 CERN fellows

Direct links to development teams in Europe and US

Program of work driven by service needs

Other contributions include OCP training

Latest release of Oracle core technology

The g stands for Grid

What does this really mean?

Lots of nice new features (next)

The self-tuning database (Richard Sarwal embedded)

Support for regular expressions

Participated in beta programme (very little effort available),

Continues trend of becoming ever more powerful / essential part

Numerous iAS concerns in recent months, workarounds have been found

On paper, AS 10g solves most critical issues and offers awaited

* Oracle Forms on the web deployment changes (from GUI to application

Still an area where more discussion / feedback with Oracle needed

Enterprise Manager 9i is used to manage / monitor some of our

EM10g aka Grid Control is a major rewrite of the Oracle

Paper evaluation is very promising (possibility to evaluate trend

Should be a key component of our tool-set in managing services

Gradually replacing home-built tools with OEM

Current services run well, but are costly!

Too many services today rely on:

0 This is no basis on which to build the future

Only way to survive with increasing number of DB and iAS servers,

Cannot take existing non-HA service, add SLA and make it HA

Several techniques available need to be further studied