Professional Documents
Culture Documents
7/11/06
1 Introduction
This project is delivering two sets of functionality:
• The BrandZ infrastructure, which enables the creation of zones that provide alternate operating
environment personalities, or brands on a Solaris(tm) 10 system.
• lx, a brand that supports the execution of Linux applications
This document describes the design and implementations of both BrandZ and the lx brand.
2 BrandZ Infrastructure
2.1 Zones Integration
It is a goal of this project that all brand management should be performed as simple extensions to the
current zones infrastructure:
• The zones configuration utility will be extended to configure zones of particular brands.
• The zone administration tools will be extended to display the brand of any created zone, as well as
to list the brand types available.
BrandZ adds functionality to the zones infrastructure by allowing zones to be assigned a brand type. This
type is used to determine which scripts are executed when a zone is installed and booted. In addition, a
zone's brand is used to properly identify the correct application type at application launch time. All zones
have an associated brand for configuration purposes; the default is the 'native' brand.
2.1.1 Configuration
The zonecfg(1M) utility and libzonecfg have been modified to be brand-aware by adding a new
“brand” attribute, which may be assigned to any zone. This brand may be assigned explicitly:
# zonecfg -z newzone
newzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:newzone> create -b
zonecfg:newzone> set brand=lx
A zone's brand, like its zonepath, may only be changed prior to installing the zone.
The rest of the user interface for the zone configuration process remains the same. To support this, each
brand delivers a configuration file at /usr/lib/brand/<name>/config.xml. The file describes
how to install, boot, clone (and so on) the zone. It also identifies the name of the kernel module (if any)
that provides kernel-level functionality for the brand, and lists the privileges that are, or may be, assigned
to the zone. A sample of the file follows:
<!DOCTYPE brand PUBLIC "-//Sun Microsystems Inc//DTD Brands//EN"
"file:///usr/share/lib/xml/dtd/brand.dtd.1">
<brand name="lx">
<modname>lx_brand</modname>
<initname>/sbin/init</initname>
<install>/usr/lib/brand/lx/lx_install %z %R %*</install>
<boot>/usr/lib/brand/lx/lx_boot %R %z</boot>
<halt>/usr/lib/brand/lx/lx_halt %z</halt>
<verify_cfg></verify_cfg>
<verify_adm></verify_adm>
<postclone></postclone>
2.1.2 Installation
The current zone install mechanism is hardwired to execute lucreatezone to install a zone. To support
arbitrary installation methods, the config.xml file contains a line of the form:
<install>/usr/lib/lu/lucreatezone -z %z</install>
...
The brand module is automatically loaded by the kernel the first time a branded zone is booted.
Each brand must declare struct brand as part of the registration process.
typedef struct brand {
int b_version;
char *b_name;
struct brand_ops *b_ops;
struct brand_mach_ops *b_machops;
} brand_t;
This structure declares the name of the brand, the version of the brand interface against which it was
compiled, and ops vectors containing both common and platform-dependent functionality. The
b_version field is currently used to determine whether a brand is eligible to be loaded into the system. If
the version number does not match the one compiled into the kernel, then we simply refuse to load the
brand module. In theory, this version number could also be used to interpret the contents of the two ops
vectors, allowing us to continue supporting older brand modules.
It is important to note that even though we have defined a linkage type for brands and have implemented a
versioning mechanism, we are not defining a formal ABI for brands. The relationship between brands and
the kernel is so intimate that we cannot hope to properly support the development of brands outside the ON
consolidation. This does not mean that we will do anything to prevent the development of brands outside of
ON, but we must minimize the possibility of an out-of-date brand accidentally damaging something within
the kernel.
The first 5 entries of this vector allow a brand to override the standard system call paths with their own
interpretations. The final entry protects Solaris from brands that make different use of the segment registers
in userspace, and vice-versa.
The SPARC-specific ops vector is:
struct brand_mach_ops {
void (*b_syscall)(void);
void (*b_syscall32)(void);
};
Of these routines, only the int80 entry of the x86 vector is needed for the initial lx brand. The other
entries are included for completeness and are only used by a trivial 'Solaris N-1' brand used for basic
testing of the BrandZ framework on systems without any Linux software.
These ops vectors are sufficient for the initial Linux brand , but supporting a whole new operating system
such as FreeBSD or Apple's OS X would almost certainly require modifying this interface.
3 The lx Brand
3.1 Brands and Distributions
The lx Brand is intended to emulate the kernel interfaces expected by the Red Hat Enterprise Linux 3
userspace. The freely available CentOS 3 distribution is built from the same SRPMs as RHEL, so it is
expected to work as well.
The interface between the kernel and userspace is largely encapsulated by the version of glibc that ships
with the distribution. As such, the interface we emulate will likely support other distributions that make use
of glibc 2.3.2. For example, while we will not be supporting the Debian distribution officially, Debian 3.1
uses this version of glibc and has been shown to work with the lx brand.
It should be noted that supporting a new distribution will always require a new install script. For RPM-
based distributions, it might be sufficient to update the existing scripts with new package lists. Adding
support for distributions such as Debian, which do not use the RPM packaging format, will require entirely
new install scripts to be created. It is relatively simple to have a variety of different install scripts within a
single brand, so simply changing the packaging format does not require the creation of a new brand.
Since all the Red Hat and CentOS versions we support include the same kernel version, the uname(2)
information reported to processes in a Linux zone is hardcoded into the lx brand.
kingston$ uname -a
Linux kingston 2.4.21 BrandZ fake linux i686 i686 i386 GNU/Linux
kingston$ more /proc/version
Linux version 2.4.21 (Sun C version 5.7.0) #BrandZ fake linux SMP
If we were to add support for other distributions, this information would have to be made zone-specific.
3.2 Installation of a Linux zone
Most Linux distributions' native installation procedures start by probing and configuring the hardware on
the system and partitioning the hard drives. The user then selects which packages to install and sets
configuration variables such as the hostname and IP address. Finally the installer lays out the filesystems,
installs the packages, modifies some configuration files, and reboots.
When installing a Linux zone, we can obviously skip the device probing and disk partitioning. For the
remainder of the installation procedure, there are several different approaches we could take.
• One approach is to execute a distribution's installation tool directly. Most of them are based on shell
scripts, Python, or some other interpreted language, so we could theoretically run the tools before
having a Linux environment running. This approach relies on the existing install tools being fairly
robust and would be hard to sustain between releases if the tools change.
• Another option is to develop our own package installation tool, which extracts the desired software
from a standard set of distribution media and copies it into the zone.
• A third option would be to adopt a flasharchive-like approach, in which we would simply unpack a
prebuilt tarball or cpio image into the target filesystem. This image could be built by us, or by a
customer that already had an installed Linux system. If we were to build this image ourselves, this
approach would give us complete control over the final image and and allow us to manually handle
any particularly ugly early-stage installation issues. This would also make it trivial for a customer to
"just try it out," an approach that has been successful for VMware, Usermode Linux, and QEMU.
It is our intention to support the last two options.
We will provide an installation script that will extract a known set of RPMs from a known set of Red Hat or
CentOS installation media. We will also allow a zone to be installed from a user-specified tarball. We will
document how a user can construct an installable tarball that can be used for flasharchive-like installations.
We will provide a “silent” installation mode as well, allowing for non-interactive installs.
You cannot run Solaris binaries inside a Linux zone. Any attempt to do so will fail with “No such
file or directory”, as it does on a real Linux system.
3.3.2 Running Linux Binaries
Rather than executing the Linux application directly, the exec handler starts the program execution in the
brand support library. The library then runs whatever initialization code it needs before starting the Linux
application.
For its initialzation, the lx brand library uses the brandsys(2) system call to pass the following data
structure to the kernel:
typedef struct lx_brand_registration {
uint_t lxbr_version; /* version number */
void *lxbr_handler; /* base address of handler */
void *lxbr_traceflag; /* address of trace flag */
} lx_brand_registration_t;
This structure contains a version number, ensuring that the brand library and the brand kernel module are
compatibile with one another. It also contains two addresses in the application's address space: the starting
address of the system call emulation code and the address of a flag indicating whether the process is being
traced. The use of these two addresses are discussed in sections 3.4 and 3.9 respectively.
Once the support library has finished initialization, it constructs a new aux vector for the Linux interpreter
to run, and jumps to the interpreter's entry point. The brand's exec handler replaces the standard aux vector
entries with the corresponding values above, clears the above vectors (by setting their type to
AT_IGNORE), resets the stack to its pre-Solaris-linker state, and then jumps to the non-native interpreter
which then runs the executable as it would on its own native system.
The advantages of this design are:
• No modifications to the Solaris runtime linker are necessary.
• Virtually all knowledge of branding is isolated to the kernel brand module and userland brand
support library.
• It keeps us aligned with the de-facto standard for non-native emulation established with SunOS 4
BCP.
When debugging an application with mdb, any attempt to step into or over a system call will leave the user
in the BrandZ emulation code.
Each Linux system call can be divided into one of three types: pass-through, simple emulation, and
complex emulation.
3.4.1 Pass-Through
A pass-through call is one that requires no data transformation and for which the Solaris 10 semantics
match those of the Linux system call. These can be implemented in userland by immediately calling the
equivalent system call.
For example:
int
lx_read(int fd, void *buf, size_t bytes)
{
int rval;
Other examples of pass-through calls are close(), write(), mkdir(), and munmap().
Although the arguments to the system call are identical, the method for returning an error to the caller
differs between Solaris and Linux. In Solaris, the system call returns -1 and the error number is stored in the
thread-specific variable errno. In Linux, the error number is returned as part of the rval.
There are also differences in the error numbers between Solaris and Linux. The lx_read() routine is
called by lx_emulate(), which handles the translation between Linux and Solaris error codes for all
system calls.
Other examples of simple emulated calls are stat(), mlock(), and getdents().
Flag Supported?
CLONE_VM Yes
CLONE_FS Yes
CLONE_FILES Yes
CLONE_SIGHAND Yes
CLONE_PID Yes
CLONE_PTRACE Yes
CLONE_PARENT Partial. Not supported for fork()-style clone() operations.
CLONE_THREAD Yes
CLONE_SYSVSEM Yes
CLONE_SETTLS Yes
CLONE_PARENT_SETTID Yes
CLONE_CHILD_CLEARTID Yes
CLONE_DETACH Yes
CLONE_CHILD_SETTID Yes
When an application uses clone(2) to fork a new process, the lx_clone() routine simply calls
fork1(2). When an application uses clone(2) to create a new thread, we call the thr_create
(3C) routine in the Solaris libc.
The Linux application provides the address of a function at which the new thread should begin executing as
an argument to the system call. However, the Linux kernel does not actually start execution at that address.
Instead, the kernel essentially does a fork(2) of a new thread which, like a forked process, starts with
exactly the same state as the parent thread. As a result, the new thread starts executing in the middle of the
clone(2) system call, and it is the glibc wrapper that causes it to jump to the user-specified address.
This Linux implementation detail means that when we call thr_create(3C) to create our new thread,
we cannot provide the user's start address to that routine. Instead, all new Linux threads begin by executing
a routine that we provide, called clone_start(). This routine does some final initialization, notifies the
brand's kernel module that we have created a new Linux thread, and then returns to glibc.
A by-product of threads implementation in Linux is that every thread has a unique PID. To mimic this
behavior in the lx brand, every thread created by a Linux binary reserves a PID from the PID list. This
reservation is performed as part of the clone_start() routine. (Note: to prevent a pidreserve bomb
from crippling the system, the zone.max-lwps rctl may be used to limit the number of PIDs
allocated by a zone.)
This reserved PID is never seen by Solaris processes, but it is used by Linux processes. When a Linux
thread calls getpid(2), it is returned the standard Solaris PID of process. When it calls gettid(2), it
is returned the PID that was reserved at thread creation time. Similarly, kill(2) sends a signal to the
entire process represented by the supplied PID, while tkill(2) sends a signal to the specific thread
represented by the supplied PID.
The Linux thread model supported by modern RedHat systems is provided by the Native Posix Threads
Library (NPTL). NPTL uses three consecutive descriptor entries in the Global Descriptor Table (GDT) to
manage thread local storage. One of the arguments to the clone() is an optional descriptor entry for TLS.
More commonly used is the set_thread_area() system call, which takes a descriptor as an argument
and returns the entry number in the GDT in which it has been stored. The NPTL then uses this to initialize
the %gs register. The descriptors are per thread, so they have to be stored in per thread storage and the GDT
entries must be re-initialized on context switch. This is done via a restore ctx operation.
Since both NPTL and the Solaris libc rely on %gs to access per-thread data, we have added code to
virtualize its usage. The first thing our user-space emulation library does is:
/*
* Save the Linux libc's %gs and switch to the Solaris libc's %gs
* segment so we have access to the Solaris errno, etc.
*/
pushl %gs
pushl $LWPGS_SEL
popl %gs
This sequence ensures that we always enter our Solaris code using the well-known value used for our %gs.
We also stash the current value of %gs on the stack, so we can restore it prior to returning to Linux code.
3.5.2 EFAULT/SIGSEGV
If the user-space emulation library were to access an argument from a system call which had an invalid
address, a SIGSEGV signal would be generated. For proper Linux emulation, the desired result in this
situation is to generate an error return from the system call with an EFAULT errno.
To deliver the expected behavior, we will introduce a new system call (uucopy()), which copies data
from one user address to another. Any attempt to use an illegal address will cause the call to return an error.
Otherwise, the data will be copied as if we had performed a standard bcopy() operation.
For example:
int
lx_system_call(int *arg)
{
int local_arg;
int rval;
/*
* catch EFAULT
*/
if ((rval = uucopy(arg, &arg, sizeof (int))) < 0)
return (rval); /* errno is set to EFAULT */
/*
* transform the arg, now in local_arg, to Solaris format
*/
return (solaris_system_call(&local_arg));
}
This functionality seems to be generically useful, so the uucopy() call will be implemented in libc,
where it will be available to any application.
If the overhead imposed by this system call dramatically limits performance, we may include an
environment variable that causes the brand library to perform a standard userspace copy rather than the
kernel-based copy. Setting this variable would lead to higher performance, but some system calls would
segfault rather than returning EFAULT.
3.5.4 /etc/mnttab
Linux keeps the current filesystem mount state in /etc/mtab. This file is a plain text file and its contents
are maintained by the mount command. Applications trying to determine what mounts currently exist on
the system normally access this file via setmntent(3) call. Linux also exports the current system mount
state via /proc/mounts, ut most applications don't access this file. For applications that do attempt to access
this file it is emulated in lx_procfs.
3.5.5 ucontext_t
The Linux and Solaris ucontext_t structures are slightly different. The most significant differences
are that the Linux ucontext_t contains a signal mask and a copy of the x86 %cr2 register. The
signal mask is maintained by glibc – not the Linux kernel, so there is no extra work required of us. When
we deliver a SIGSEGV, we fill in the %cr2 field using information available in the siginfo_t.
3.6 Signal Handling
Delivering signals to a Linux process is complicated by differences in signal numbering, stack structure and
contents, and the action taken when a signal handler exits. In addition, many signal-related structures, such
as sigset_ts, vary between Solaris and Linux.
The simplest transformation that must be done when sending signals is to translate between Linux and
Solaris signal numbers.
Major signal number differences between Linux and Solaris
Number Linux Solaris
10 SIGUSR1 SIGBUS
12 SIGUSR2 SIGSYS
16 SIGSTKFLT SIGUSR1
17 SIGCHLD SIGUSR2
18 SIGCONT SIGCHLD
19 SIGSTOP SIGPWR
20 SIGTSTP SIGWINCH
21 SIGTTIN SIGURG
22 SIGTTOU SIGPOLL
23 SIGURG SIGSTOP
24 SIGXCPU SIGTSTP
25 SIGXFSZ SIGCONT
26 SIGVTALARM SIGTTIN
27 SIGPROF SIGTTOU
28 SIGWINCH SIGVTALARM
29 SIGPOLL SIGPROF
30 SIGPWR SIGXCPU
31 SIGSYS SIGXFSZ
When a Linux process sends a signal using the kill(2) system call, we translate the signal into the
Solaris equivalent before handing control off to the standard signalling mechanism. When a signal is
delivered to a Linux process, we translate the signal number from Solaris back to Linux. Translating signals
both at generation and at delivery time ensures both that Solaris signals are sent properly to Linux
applications and that signals' default behavior works as expected.
One issue is that Linux supports 32 real time signals, with SIGRTMIN typically starting at or near 32
(SIGRTMIN) and proceeding to 63 (SIGRTMAX) (SIGRTMIN) is "at or near" 32 because glibc usually
"steals" one ore more of these signals for its own internal use, adjusting SIGRTMIN and SIGRTMAX as
needed.) Conversely, Solaris actively uses signals 32-40 for other purposes and and only supports 7 realtime
signals, in the range 41 (SIGRTMIN) to 48 (SIGRTMAX).
At present, attempting to translate a Linux signal greater than 39 (corresponding to the maximum real time
signal number Solaris can support) will generate an error. We have not yet found an application that
attempts to send such a signal.
Branded processes are set up to ignore any Solaris signal for which there is no direct Linux analog,
preventing the delivery of untranslatable signals from the global zone.
for BrandZ Linux threads, this instead would look like this:
kernel ->
lx_sigacthandler() ->
sigacthandler() ->
call_user_handler() ->
lx_call_user_handler() ->
Linux user signal handler
The routine works by modifying the per-thread data structure that libc already maintains that keeps track of
the address of its own interposition handler with the address passed in; the old handler's address is set in the
pointer pointed to by the second argument, if it is non-NULL, mimicking the behavior of sigaction()
itself. Once setsigacthandler() has been executed, all future branded threads this thread may create
will automatically have the proper interposition handler installed as the result of a normal sigaction()
call.
Note that none of this interposition is necessary unless a Linux thread registers a user signal handler,
because the default action for all signals is the same between Solaris and Linux save for one signal,
SIGPWR. For this reason, BrandZ always installs its own internal signal handler for SIGPWR that translates
the action to the Linux default, to terminate the process. (Solaris' default action is to ignore SIGPWR.)
It is also important to note that when signals are not translated, BrandZ relies upon code interposing upon
the wait(2) system call to translate signals to their proper values for any Linux threads retrieving the
status of others. So, while the Solaris signal number for a particular signal is set in the data structures for a
process (and would be returned as the result of, for example, WTERMSIG()), the BrandZ interposition
upon wait(2) is responsible for translating the value WTERMSIG(), and would return from a Solaris
signal number to the appropriate Linux value.
such that when the Linux user signal handler is eventually called, the stack looks like this:
Pointer to sigreturn trampoline code
Linux signal number
Pointer to Linux siginfo_t
Pointer to Linux ucontext_t
Linux ucontext_t
Linux fpstate
Linux siginfo_t
BrandZ takes the approach of intercepting the Linux sigreturn(2) (or rt_sigreturn(2)) system
call in order to turn it into a return through the libc call stack that Solaris expects. This is done by the
lx_sigreturn() or lx_rt_sigreturn() routines, which remove the Linux signal frame from the
stack and pass the resulting stack pointer to another routine, lx_sigreturn_tolibc(), which makes
libc believe the user signal handler it had called returned.
When control then returns to libc's call_user_handler() routine, a setcontext(2) will be done
that (in most cases) returns the thread executing the code back to the location originally interrupted by
receipt of the signal.
One final complication in this process is the restoration of the %gs segment register when returning from a
user signal handler. Prior to BrandZ, Solaris' libc forced the value of %gs to a known value when calling
setcontext() to return to an interrupted thread from a user signal handler (as libc uses %gs internally
as a pointer to curthread, it is a way of ensuring a good "known value" for curthread.)
Since BrandZ requires that setcontext() restore a Linux value for %gs when returning from a Linux signal
handler, we made this forced restoration optional on a per-process basis. This was accomplished by means
of a new private routine to libc:
void set_setcontext_enforcement(int on)
By default, the "curthread pointer" value enforcement is enabled. When this routine is called with an
argument of '0', the mechanism is disabled for this process.
Shutting off this mechanism will not have any correctness or security implications. Writing to the %gs
segment register is not a privileged operation and as such %gs can be set to any value at any time by user
code. The only drawback to disabling the mechanism is that if a bad value is set for %gs, the broken
application will likely segmentation fault deep within libc.
Networking Devices
Native Solaris non-global zones have a network interface that is visible (reported via ifconfig), but there are
no actual network device nodes accessible via /dev. Certain higher level network protocol devices are
accessible in native zones:
Looking at a native Linux 2.4 system, we see that the following network devices exist:
/dev/inet/egp, /dev/inet/ggp, /dev/inet/icmp, /dev/inet/idp
/dev/inet/ip, /dev/inet/ipip, /dev/inet/pup, /dev/inet/rawip
/dev/inet/tcp, /dev/inet/udp
[1]ptm is a clone device, so this translation is tricky. Basically, the /dev/ptmx node in a
native zone points to the clone device, but when an open is done on this device, the vnode
that is returned corresponds to a ptm node (and not a clone node). This means that on a
Solaris system, a stat of /dev/ptmx will return different dev_t values than an fstat
(2) of an fd that was created by opening /dev/ptmx. On Linux, both of these operations
need to return the same result. So once again, we are mapping multiple major/minor Solaris
device numbers to a single Linux device number.
[2]For pts devices, there will be no translation done for device minor node numbers.
Linux device numbers are currently hard coded into dev_t translators. It was suggested that we move this
mapping into an external XML file, to simplify the task of adding a new translator. This change would not
significantly reduce the effort required, since the dev_t translators are only a small portion of the updates
required to support a new device in a Linux zone. In general to add a new devices into a Linux zone the
following things need to be done:
- dev_t translators need to be added for stat(2) calls.
- ioctl translators need to be added for any ioctls supported on the devices.
- mappings need to be added for the devices to rename them from standard Solaris names into
Linux device names.
Since device names and semantics are different between Solaris and Linux, by we will not support
adding devices Linux zones via the zonecfg(1M) device resource. (ie, zonecfg(1M) will prevent
an administrator from adding "device" resources to a Linux zone.)
For the most part, applications are not likely to be sensitive to these device numbering issues. Searching
Linux distro source rpms for references to st_rdev reveals lots of other references to hard coded device
number references, but most of these references are in code that we would not expect to support in a Linux
zone: utilities to manage cds, dvds, raid arrays, and so on.
3) Socket ioctls:
FIOGETOWN, SIOCSPGRP, SIOCGPGRP, SIOCATMARK,
SIOCGIFFLAGS, SIOCGIFADDR, SIOCGIFDSTADDR,
SIOCGIFBRDADDR, SIOCGIFNETMASK, SIOCGIFMETRIC,
SIOCGIFMTU, SIOCGIFCONF, SIOCGIFNUM
Most of these ioctls are streams ioctls, and since FIFOs and sockets are implemented via streams in Solaris,
any FIFO or socket supports most of these ioctls. Of the 45 ioctls listed above, only 8 are actually device-
specific ioctls.
This indicates that doing ioctl translations via layered drivers is not the best approach, since this would only
address a minor subset of the total ioctls that need to be supported. Because supporting non-device ioctls
will require the creation of a non-layered driver ioctl translation mechanism, it seems more appropriate to
handle device ioctls via this same mechanism as well.
With this in mind, it's more interesting if the categories above are renamed based in terms of their vnode
v_type and v_rdev values. If we do this, we get:
1) VREG, VFIFO, VSOCK, VCHR[ptm, pts, sy, zcons]
2) VFIFO, VSOCK, VCHR[ptm, pts, sy, zcons]
3) VSOCK
4) VCHR[ptm]
5, 6, 7, 8) VCHR[pts]
Supporting ioctls on these vnodes will require a switch table. In addition to the ioctl number, the translation
mechanism must look at the type of the file descriptor an ioctl is targeting to determine what translation
needs to be done. Hence, the translation layer will need to looking at the v_type and the major portion of
v_dev associated with the target file descriptor. These fields are easily accessible from the kernel and are
also available via st_mode and st_rdev from fstat(2). So this translation could occur either in the kernel or in
userland.
One tricky part about this mapping is that we don't want to hard code the the major Solaris driver number
into any translation code, since these number are allocated dynamically via /etc/name_to_major in
the global zone. Therefore, device ioctl translators will be bound to specific Solaris drivers by their driver
name. When an application attempts to perform an ioctl to a driver, the translation code will resolve the
driver name to driver major number mapping. This translation code will not have any impact on how
devices are managed in the global zone.
Device Notes
------ -----
/dev/null read-write, doesn't have any consumer state
/dev/zero read-write, doesn't have any consumer state
The other important aspect of device paths is how brand/zone-specific device paths are mapped into
branded zones. Here is an example of some the global brand/zone-specific device paths and their path
mappings as seen from within linux branded zones.
Global zone device path Linux zone device path
----------------------- ----------------------
/dev/zcons/<zone_name>/zoneconsole /dev/console
/dev/brand/<brand_name>/ptmx /dev/ptmx
/dev/brand/<brand_name>/dsp /dev/dsp
/dev/brand/<brand_name>/mixer /dev/mixer
3.7.5.3 /dev/fd/*
The entries in /dev/fd are not actually devices. The entries in /dev/fd/ allow a process access to its
own file descriptors via another namespace. Thus, opens of entries in this directory map to re-opens of the
corresponding file descriptor in the current process.
In Solaris /dev/fd is implemented via a filesystem. readdir(3C)s of /dev/fd might not return an
accurate reflection of the current file descriptor state of a process, but opens of specific entries in the
directory will succeed if that file descriptor is valid for the process performing the open.
In Linux, /dev/fd is implemented as a symbolic link to /proc/self/fd. This /proc filesystem
directory is similar to the Solaris /proc/<pid>/fd directory in that it contains an accurate
representation of a processes current file descriptor state. But aside from just providing access to the
processes current file descriptors, on Linux the files in this directory are actually symbolic links to the
underlying files referenced by the processes file descriptors. This is similar to the functionality in Solaris
provided by /proc/<pid>/paths.
The most common uses for /dev/fd entries are for suid shell script and as parameters to commands that
don't natively support I/O to stdin/stdout. Given these use cases it seems that a simple mount of the
existing Solaris /dev/fd filesystem in the Linux zone should be sufficient for compatibility purposes.
OSS or ALSA:
xmms (selectable via plugins)
Our survey identified no popular applications that require ALSA, so we will only be supporting OSS audio.
Audio device access on Linux and Solaris is done via reads, writes, and ioctls to different devices.
OSS devices:
/dev/dsp, /dev/mixer
/dev/dsp[0-9]+, /dev/mixer[0-9]+
Solaris devices:
/dev/audio, /dev/audioctl
/dev/sound/[0-9]+, /dev/sound/[0-9]+ctl
Unfortunately, we can't simply map the Solaris /dev/audio and /dev/audioctl devices to /
dev/dsp and /dev/mixer devices in a Linux and expect the ioctl translator to handle everything else
for us. Some of the reason for this are:
• The admin/user may not always want a Linux branded zone to have access to system audio devices.
• There may be multiple audio devices on a system each of which may support only input, only
output, or both input and output. In Solaris a user can specify which audio device an application
should access by providing a /dev/sound/* path to the desired device. But in the Linux zone
the admin might want the Linux audio device to map to separate Solaris audio devices for input
and/or output.
• Linux ioctl translation is done using dev_t major values. On Solaris opening /dev/audio will
result in accessubg different device drivers based of what the underlying audio hardware is, and
these different drivers may have different dev_t values. Hence, if audio devices were directly
imported the dev_t translator would need to have knowledge of every potential audio device driver
on the system, and as new audio drivers are added to the system this translator would need to be
made aware of them as well.
• In Linux audio devices are character devices and support mmap operations. On Solaris audio
devices are streams based and do not support mmap operations.
To deal with these problems the following components are provided:
• A way for the user to enable audio support in a zone via zonecfg.
The user enables audio via zonecfg boolean attribute called "audio". (The absence of this
attribute implies a value of false.) Adding this resource to a zone via zonecfg looks like this:
--
zonecfg:centos> add attr
zonecfg:centos:attr> set name="audio"
zonecfg:centos:attr> set type=boolean
zonecfg:centos:attr> set value=true
zonecfg:centos:attr> end
zonecfg:centos> commit
zonecfg:centos> exit
--
3.7.6 Networking
All network plumbing is done by the global zone – just as network plumbing is done today for native
zones.) The Linux zone administrator should not configure the Linux zone to attempt to plumb network
interfaces. Any attempt to plumb network interfaces or change network interface settings from within the
Linux zone will fail. In most cases the failure will manifest as an ioctl(2) operation failingwith
ENOTSUP.
Any networking tools that utilize normal tcp/udp socket connections (e.g., ftp, telnet, ssh, etc.)
should work. Any tools that require raw socket access (e.g., traceroute, nmap) will fail with
ENOTSUP. Utilities that query network interface properties (e.g., ifconfig) will work, although
attempts to change network configuration will fail.
On Solaris lockd is a userland daemon with a significant kernel component: klmmod. Most of the
lockd functionality is actually implemented in the kernel in klmmod. The kernel lockd component also
uses private undocumented interface (NSM_ADDR_PROGRAM) to communicate with statd.
On Linux lockd is actually entirely contained within the kernel. When the kernel starts up the lockd
services, it creates a fake process that is visible via the ps command but lacks most normal /proc style
entries.
Given the how closely integrated the separate components of the NFS client are on Solaris, and given that
that most of the NFS client on Linux is in the kernel and there for not usable by the lx brand, the approach
taken to support the NFS client in the lx brand was to simply run the Solaris NFS client within the lx zone.
Adding support for using the all Solaris NFS client components in a zone involved modifications in
BrandZ, the lx brand, and base Solaris. Some of these areas and the modifications that were required are
described below.
3.8.1.3.2 Chroot
One problem with running native Solaris binaries in a branded zone is that both the native binary and native
libraries that they use expect to be able to access native Solaris paths and files that may not exist inside a
branded. Rather than implementing a path mapping mechanism to re-direct filesystem accesses for native
binaries to paths into /native, during the startup of these daemons we do a chroot("/native"). We've also
ensured that there is enough of the native Solaris environment created in /native to allow lockd and
statd run properly.
3.8.1.4 Alowing lockd and statd to communicate with Linux services/interfaces within the zone
lockd and statd are fairly self contained but they do require access to certain services for which the
native Solaris versions won't be available in a zone. An audit of lockd and statd reveal that these
daemons depend on access to the following services:
• naming services (via libnsl.so)
• syslog (via libc.so)
Normally, these daemons simply access these services via local libraries. These libraries in turn use local
files, other libraries, and various network based resources to resolve requests. In a branded zone most of
these resources will not be available. For example, we can't expect the Solaris libnsl.so library to know how
to parse Linux NIS configuration files.
To handle these requests we need to be able to leverage existing Linux services and interfaces. This requires
translating certain Solaris lockd and statd services requests into Linux service requests, and then
translating any results back into a format that Solaris libraries and utilities are expecting. In the lx brand
we've decided to call this process of translating service requests thunking. (akin to a 32-bit OS calling into
16-bit BIOS code.) To service these requests we have created a thunking layer which translates Solaris calls
into Linux calls.
This thunking layer works as follows:
1. When lockd or statd make a request that requires thunking, this request ends up getting
directed into a library in the process called lx_thunk.so (the mechanism used to direct requests into
this library varies based of the type of request being serviced and is discussed further below).
2. The lx_thunk.so library packs up this request and sends it via a door to child Linux process called
lx_thunk.
3. If the lx_thunk process does not exist then the lx_thunk.so library will fork(2)/exec(2) it.
4. The lx_thunk process is a one line /bin/sh script that attempts to execute itself and is executed in a
Linux shell. When the brand emulation library (lx_brand.so) detects that it is executing as the
lx_thunk process and it is attempting to re-exec itself, the library takes over the process and sets
itself up as a doors server.
5. When the lx_thunk process receives a door request from lx_thunk.so library in a native process, it
unpacks the request and uses a Linux thread to call invoke Linux interfaces to service the request.
6. Once it is done servicing the request it packs up any results and returns them via the same door call
that it received the request on.
This thunking layer means that now the lx brand is dependent upon Linux interfaces so we need to worry
about Linux interfaces changing and breaking the lx_thunk server process. To help avoid this possibility,
most the Linux interfaces that we've chosen to use are extremely well known and listed in the glibc ABI.
All of the interfaces are used by many applications outside of glibc. Here are the Linux interfaces currently
used by the lx_thunk process:
• gethostbyname_r
• gethostbyaddr_r
• getservbyname_r
• getservbyport_r
• openlog
• syslog
• closelog
• __progname
Also worth mentioning is the means by which service requests that require thunking are directed to
lx_thunk.so. To intercept name service requests the lx brand is introducing a new libnsl.so plugin name-to-
address translation library. Libnsl already supports name-to-address translation plugin libraries that can be
specified via netconfig(4). For lx branded zones there will be a custom netconfig file installed into /
native/etc/netconfig that will instruct libnsl.so to redirect name service lookup requests to a new library
called lx_nametoaddr.so. This library will in turn resolve name service requests using private interfaces
exported from the thunking library, lx_thunk.so.
3.8.2 Automounters
Linux supports two automounters: "amd" and "automount".
amd is implemented as a userland NFS server. It mounts NFS filesystems on directories where it will
provide automount services, and specified itself as the server for these NFS filesystems. To support amd
only required adding translation support for all the Linux system call mount options it expects to work.
automount, the more common (and often default) automounter, is substantially more complex than amd.
automount relies on a filesystem module called autofs. Upon startup, automount mounts the autofs
filesystem onto all automount controlled directories. As an option to the mount command it passes a file
descriptor that indicates a pipe will be used to send requests to the automount process. automount listens for
requests on this pipe. When it gets a request, it looks up shares via whatever name services are configured,
executes mount(2) system calls as necessary, and notifies the autofs filesystem that a request has been
serviced. The exact semantics of the interfaces between automount and autofs are versioned and appear to
differ based of the Linux kernel version. To support automount the lx brand will introduce a new filesystem
module called lx_afs. When the automount process attempts to mount the autofs filesystem we will
instead mount the lx_afs filesystem which will emulate the behavior one specific version of the autofs
filesystem.
3.10 /proc
The lx brand will deliver a lx_proc kernel module that provides the necessary semantics of a Linux /proc
filesystem.
Linux tends to use /proc as a dumping ground for all things system-related, although this is reduced by the
introduction of sysfs in the 2.6 kernel. Thus, we will not be able to emulate a large number of elements from
within a zone. Examples of unsupported functionality include physical device characteristics, the USB
device tree, acccess to kernel memory, etc. Because various commands expect these files to be present, but
do not actually act on their contents, a number of these files will exist but otherwise be empty.
We are able to emulate the per-process directories completely. The following table shows the support status
of other /proc system files.
4 Deliverables
4.1 Source delivered into ON
Below is a summary of the new sources being added to ON.