You are on page 1of 9

Optimal Control of Storage Regeneration with

Repair Codes
Francesco De Pellegrini , Rachid El Azouzi, Alonso Silva and Olfa Hassani

AbstractHigh availability of containerized applications re- containerized applications, whenever a container fails, such
quires to perform robust storage of applications state. Since failure can be masked, while the related traffic and tasks are
basic replication techniques are extremely costly at scale, storage redirected to healthy replicas. Incidentally, this is also the
space requirements can be reduced by means of erasure and/or
arXiv:1711.03034v1 [cs.IT] 8 Nov 2017

repairing codes. standard technique for seamingless migration of containerized


In this paper we address storage regeneration using repair applications across cloud servers for load-balancing purposes.
codes, a robust distributed storage technique with no need to Cloud native applications to be containerized are ideally in-
fully restore the whole state in case of failure. In fact, only the stantiated in a stateless fashion. This makes it simple to render
lost servers content is replaced. To do so, new clean-slate storage container execution highly available. However, containerized
units are made operational at a cost for activating new storage
servers and a cost for the transfer of repair data. applications not always can be made fully stateless. Instead,
Our goal is to guarantee maximal availability of containers they can store the running state in a replicated distributed stor-
state files by a given deadline. Upon a fault occurring at a age. One existing deployment in the literature is found in [5].
subset of the storage servers, we aim at ensuring that they are By using dedicated plug-ins, persistent volume from inside
repaired by a given deadline. We introduce a controlled fluid containers is made accessible. The state is hence saved onto
model and derive the optimal activation policy to replace servers
under such correlated faults. The solution concept is the optimal the distributed file system before replacement or migration,
control of regeneration via the Pontryagin minimum principle. and the new container can finally access the recorded state
We characterize feasibility conditions and we prove that the [2].
optimal policy is of threshold type. Numerical results describe In order to maintain an up-to-date version for restoring or to
how to apply the model for system dimensioning and show the migrate running containers, snapshot images of the containers
tradeoff between activation of servers and communication cost.
Index Termshigh availability, containers, regeneration, re- state have to be created. Commit commands available on
pair codes, optimal control container platforms [6] can be used and several optimizations
are possible to this respect, e.g., by continuously synchronizing
I. I NTRODUCTION changes only. Furthermore, in this context many core aspects
Container technology has quickly become the most promis- are relevant, including load balancing, replica synchronization,
ing cloud virtualization technique for it is lightweight and system monitoring, alarm generation, and configuration man-
portable to different hardware. The uptake of containerization agement. Such aspects are beyond the scope of this work.
is fast up to the point that containers have become the unique Instead, we focus on the mechanisms for failure recovery of
runnable entities supported by Googles infrastructure [1]. The storage serves.
main difference of containers with respect to traditional virtual In fact, robustness of data storage becomes the bottleneck
machines is the fact they are executed in the application space to ensure high availability for containers state maintenance.
of a server. In fact, containers deployment does not require Data loss events in data centers are reported as a common
the instantiation of a full operating system on top of the one event by several operators, e.g, FaceBook [7] and Yahoo [8].
ruling the host server, thus representing a lighter solution with The traditional solution is to perform server content replication
faster setup time. using three-way random replication, considered the standard
However, performing high availability of containerized ap- good practice in distributed filesystem management [9], [10],
plications is still a developing concept, e.g., building blocks [11], [12].
such as failure detection and failover management are miss- In the literature on distributed storage, nevertheless, there
ing [2]. Virtual machines and containers, in turn, may be exist techniques to reduce redundancy, e.g., by means of
supported by availability guarantees [3] corresponding to spe- erasure codes or by repairing codes. Erasure codes can achieve
cific service level agreements (SLA) to remain continuously great savings in storage space, and are actually used by major
functional (staying operational 99.999% of the time is called cloud provides such as Facebook [13] and Google [9].
the five nines rule [4]). The basic idea with erasure codes is that a file is split into
High availability requires a large degree of fault tolerance, k chunks, and then encoded into n = k + h chunks. In case
both at the software and the hardware level. In the case of of r h server failures, the system state can be recovered by
transferring the chunks from k of the n r remaining servers
Fondazione Bruno Kessler, via Sommarive, 18 I-38123 Povo, Trento, Italy;
CERI/LIA,
and decoding those to retrieve the whole original file. Then,
University of Avignon, 339, Chemin des Meinajaries, Avignon,
France; Nokia Bell Labs, Paris-Saclay, France. This research was performed the file can be encoded all over again into n chunks and finally
while the first author was visiting Nokia Bell Labs. the lost encoded chunks are restored on a set of r replacement
Table I
M AIN NOTATION USED THROUGHOUT THE PAPER 1 1 1 1
Symbol Meaning 2 2
B state size
repairing chunk size

C = (n, d, k) repairing code k
c1 cost per repair node activation
c2 cost per transferred repair chunk bit k
k+1
u(t) activation control d
maximum activation rate k+1
number of repair chunks transferred per second
repair server failure rate
Xd (0) number of operational repair servers at time 0
X0 (t) number of newly activated repair servers at time t
Xk (t) number of repair nodes having k servers at time t
n1 d+1
n n
servers. We observe that in our context the servers may be
either physical servers or virtual storage units, and faults may Figure 1. Storage regeneration: repairing failures via erasure codes (left) and
be due to simultaneous node failures due, e.g., to cluster-wide repairing codes (right); r = 1, d = n 2
power outages [11].
When there exists a large number of containers, the data
maximum distance type separable (MDS). Several follow up
transfer phase can become bottleneck for fast recovery in
works, e.g., [14], [16] have explored the fundamental tradeoff
private clouds and a costly service to offer at scale in a
of such codes. They can be either of the minimum storage
public cloud. A recent solution is the usage of repairing
(MSR) or the minimum bandwidth (MBR) regenerating type.
codes [14], [15], [16]. Several trade-offs for such technique are
When the code can be maintained in systematic form, simple
addressed in [17], showing a 10-fold improvement is possible
repair by transfer with no decoding operations is possible.
over standard erasure coding.
However, in general, regeneration involves also decoding and
In this work, we investigate feasibility and cost of re-
so computing-time [17], a facet of the problem that we leave
generation operations using repair codes under correlated
as part of future works.
faults, i.e., when several servers fail at once. State availability
In the rest of the paper, we consider an assigned deadline for
requirements are represented by a deadline T to regenerate all
failsafe operations, as proposed in [3]: in that work, recovery
servers. The cost that it takes to maintain seamless operation of
time limits are imposed on the parallel failover of virtual
containers involve both state storage, i.e., activating enough
machines based on customers SLA plans. Also, in this work
replacement servers, and communication costs, i.e., the cost
we adopt a system perspective close to [17]. To the best of the
of transferring coded data chunks to regenerate lost servers.
authors knowledge, this is the first paper describing optimal
In the rest of the paper, the limit performance of the system
control of failsafe operations for storage regeneration.
are derived using an optimal control framework.
The paper is organized as follows. In Sec. II we review the III. S YSTEM M ODEL
related literature, whereas in Sec. III we introduce the system In order to perform repair coding, the containers state is
model. In Sec. IV we formulate the problem of state storage divided into k chunks and encoded into n = k + h ones,
regeneration in the framework of optimal control. Sec. V by using a repairing code C = (n, k, d), where n > d > k.
details the solution. Sec. VI provides numerical results and Parameter d represents the number of chunks that can be used
Sec. VII concludes the paper. The complete proofs of the to repair a lost or corrupted one. Each chunk is hence stored by
statements derived in this paper can be found in Appendix. distributing the encoded chunks to n servers. At time t = 0,
r-servers fail, with 0 < r n d, whereas n r servers
II. R ELATED W ORKS
are still operational. In case of a r-servers fault, there are two
Designing robust storage in the cloud is a classical problem. main restoration options: either full restoration or regeneration
Random replication schemes appeared in the early Google of failed servers. If r < h, full state restoration is possible
filesystem [9] and in Facebook data centers [10]. Basic erasure from any set of k servers chunks: full restoration requires to
codes achieve higher reliability compared to replication with transfer k data chunks, which have bytes each, to reconstruct
same storage [18]. The cost reduction in datacenter footprints the whole state file, to perform the encoding process all over
operations is dramatic, exceeding 50%, thus recommending again and, finally, to transfer the re-encoded chunks to the
their usage in next generation systems [19]. Hence, new destination servers (see Fig. 1).
specialized erasure codes appeared, such as the local recon- Instead, selective regeneration of failed servers is possible
struction codes in Windows Azure Storage. [20], or piggy- when r < n d: each lost server is replaced by using the
backed Reed-Solomon codes to reduce cross-racks restoration chunks of d repair servers, by transferring bits of information
bandwidth in Facebooks datacenters [13]. The breakthrough from each encoded chunk. Clearly, repairing is possible as
in the field are the erasure codes introduced by Papailiopoulos long as there exists at least d repair servers. Optimal repairing
and Dimakis in [15], a class of locally repairable codes of MSR codes set = B/k and = /(d k + 1), whereas

2
optimal repairing MBR codes set = 2dB/[k(2d k + 1)] Let assume that once k chunks are acquired, the repair
and = /[k(2d k + 1)] [14]. process proceeds by downloading from the remaining d k
In order to obey to availability constraints, we assume that repairing servers. Hence, for any initial state x, we can write
repairing operations need to complete by time horizon T , i.e., the entries of the transition probability matrix
it must hold Xd (T ) = n. Once the regeneration procedure P (dt) = P {X
x ,x t+dt = x |Xt = x} =
through repair codes is completed, the full set of n operational
repairing nodes is restored. We model such procedure as

u(t) dt if x = x + e0


follows. First, new repairing servers are activated, e.g., by
x0 dt if x = x e0
adding a new physical node to the datacenter, or by installing = (d k + 1)xk1 dt if x = x + ek ek1 (1)
dedicated storage virtual machines on servers already part of


xk dt if x = x ek1
the fabric. They can be switched on at a maximum rate ; the

o(dt) otherwise
activation process is a Poisson process with rate , i.e., new
servers can be activated at rate > 0 new replacement servers where with ek is the k-th element of the standard basis. The
per second. first row describes the event of activation and the second row
Once activated, a repairing server downloads parity infor- the failure of a newly activated repairing server, respectively.
mation from d operational repairing servers. We assume that The third row describes the acquisition of a repair chunk by
each chunk transfer requires an exponential random time with a repairing node having k 1 chunks, and the fourth row
mean 1/ > 0. The regeneration procedure has two cost describes the failure of a node having retrieved k chunks. The
components: last row states that multiple transitions are negligible in the
i. activation cost: activating a new repairing server has a corresponding infinitesimal generator.
cost c1 per repairing server, due to the usage of legacy The process of regeneration of the servers can be studied
hardware in the datacenter and the related setup costs; using a fluid model. Due to the structure of system (1),
ii. transfer cost: data transfer has a cost c2 per bit, hence a the meanfield approximation can be proved tight for n in
chunk transfer has a cost c2 . the order of a few tenths [21]. By using the resulting fluid
During the regeneration process, due to hardware and/or approximation, in the next section we shall obtain an optimal
software issues, failure of repairing servers may occur as well; control problem in continuous time.
failure instants are modeled as exponential random variables The control space U is the set of the piecewise continuous
of parameter . functions taking values in [0, 1]. The dynamics of the number
The number of newly activated servers is denoted by X0 (t), of repairing servers thus writes
whereas Xk (t) denote the number of replacement servers that X0 (t) = 0 X0 (t) + u(t) = f0 (X, u, t)
have k repair chunks, for k = 1, . . . , d. Only nodes retrieving d X1 (t) = 1 X1 (t) + dX0 (t) = f1 (X, u, t)
chunks are operational replacement nodes: for notations sake, ..
we shall consider Xd (t) the whole set of repairing nodes, i.e., .
those include the n r which have not crashed. Restoration Xk (t) = k Xk (t) + (d k + 1)Xk1 (t) = fk (X, u, t)
of the system using repair codes is possible if and only if ..
Xd (t) d at each point in time (if k Xd (t) < n d only .
full restoration is possible, if Xd (t) < k, containers state is Xd (t) = d Xd (t) + Xd1 (t) = fd (X, u, t) (2)
lost.).
The ODE system (2) represents the dynamics of the regener-
A. Markov model and fluid approximation ation process. Here, k = + (d k) is the rate at which
We shall study how to optimally activate new repairing servers with k chunks fail to repair plus the rate at which
servers in order to successfully restore all n servers within they receive a new chunk, thus joining those having k + 1
finite time horizon T at minimum cost. We start by assum- chunks. Also, the first equation of the ODE system (2), namely
ing a stochastic control, namely, the probability u that a f0 (), incorporates the activation of new peers at controlled
replacement server is activated. The activation rate of new rate u(t).
repairing servers is u(t). The control acts by thinning the
maximum activation rate , which can be easily implemented IV. O PTIMAL C ONTROL P ROBLEM
by randomly sampling servers to be activated. Thus, u(t) The objective is to minimize the cost to restore the system
is the rate at which replacement servers become active subject by deadline T : the storage regeneration dynamics (2) is con-
to stochastic control u(t). Let us define the state of the trolled by activation control u. Hence, the objective function
system as X = (X0 , X1 , . . . , Xd ), where Xk denotes the writes
number of servers which have retrieved the content from k Z T" d1
X
#
repairing servers. The state X(t) has a dynamics described J(u) = c1 u(v) + c2 (d i) Xi (v) dv (3)
by a continuous time Markov decision process (MDP), where 0 i=0
we observe that all states X such that Xd < d are absorbing, where the first term appearing in the integral is the servers
since no repairing is possible. activation cost whereas the second one is the cost for trans-

3
ferring chunks to repair servers. We shall solve the following and n = max{ 0|X d (T ) n} and d = max{
optimization problem: 0| mint[0,T ] X d (t) d}.
Problem 1 (Optimal Storage Regeneration). Find a control In the rest of the paper we assume > 0 and feasibility in
policy u which solves: the sense meant by the previous statement.
System dimensioning. Lemma 1 provides indications for
min J(u)
uU dimensioning the system in order to guarantee feasible re-
s.t. Xd (t) d 0tT (4) generation. In particular, in the worst case we would need to
Xd (T ) = n transfer n d chunks to newly activated repair nodes. In turn,
one would choose the time horizon by which to repair, namely
where d Xd (0) n. T , and , i.e., the rate at which chunks can be transferred, and
In order for the repairing procedure to succeed, at least d the codes triple C = (n, k, d), such in a way to satisfy the
repair nodes must be present at all points in time. We observe assumptions of the above statement.
that, because (2) describes the deterministic dynamics of the
B. Relaxed problem
mean value of the underlying MDP, it is possible that some
sample paths do not satisfy the constraints, an event that should Constraint Relaxation. The terminal state constraint can be
occur with small probability. To this aim, is possible to tighten accounted by relaxing the problem in the form
constraints appearing in (4), in the form J (u) = J(u) + (n Xd (T )) (5)
d = (1 + 1 )d n = (1 + 2 )n, by means of the terminal cost function q(X) := (nXd (T )).
where 1 , 2 > 0 represent relative margins. In the rest of the We note that 0 has the role of a multiplier, and when the
paper, we shall refer to the case 1 = 2 = 0 without loss of constraint is active > 0.
generality. State Augmentation. In order to account for the first con-
Hereafter, we shall determine the conditions when the straint, we operate the augmentation of the state space by
problem is feasible, i.e., the set of solutions of the problem is introducing an auxiliary variable
not empty. Actually, we recall that, as long as k chunks exist in Xd+1 (t) = (Xd (t) d)2 1 {d Xd (t)}
the system, full restoration is still possible. However, we focus
solely on the cases when regeneration is feasible, which can where the indicating function 1 {x} = 1 if x > 0 and 1 {x} =
be determined easily by analysis of the uncontrolled dynamics, 0 if x < 0. Since
Z T
as discussed next. Xd+1 (t) = Xd+1 (v)dv + Xd+1 (0).
0

A. Feasibility and System Dimensioning We impose the auxiliary constraint Xd+1 (T ) = Xd+1 (0) = 0:
because Xd+1 (t) 0 for t [0, T ], when such two constraints
Let us denote X d (t) the dynamics corresponding to u(t) are satisfied, then Xd (t) d all over the interval [0, T ].
1 in the interval [0, T ]. Because the activation control is We denote the problem of minimizing J (u) the relaxed
basically slowing down the maximum activation rate , it holds problem and it will be solved next.
Xd (t) X d (t) for all t [0, T ]. Hence, it is immediate to
observe that the problem is feasible if and only the dynamics C. Hamiltonian formulation and Pontryagin Principle.
of X d is compatible with the constraints. Such condition can Let denote g(X, u, t) the instantaneous cost appearing inside
be derived in closed form. By writing the Laplace transform the integral cost (3). In order to solve the optimal control
of (2), i.e., Xk (s) = L{Xk (t)} we obtain problem, it is possible to write the Hamiltonian for the optimal
control problem in standard form
X 1 (s) Xd1 (s) + X d (0)
X 0 (s) = , X 1 (s) = , . . . , X d (s) = H(X, u, p) = p(t) f (X, u) + g(X)
s + 0 s + 1 s + d
d where p is the vector of co-state variables Hence, according
d (0)
which in turn provides X d (s) = Qd (s+d!
+ Xs+ . As to the Pontryagin Minimum Principle [22], [23], the optimal
k=0 k)
showed in the Appendix, the following closed form expression control u needs to satisfy
for the dynamics of the repairing servers holds:
 d  u(t) = arg min H(X, u, p)
uU
X d (t) = et 1 et + X d (0)
where the associated Hamiltonian system is
Feasibility conditions can be described in terms of the system
Xk = Hpk (X, u, p) (6)
parameters as follows:
d pk = HXk (X, u, p) (7)
Lemma 1. Problem 1 is feasible if and only if 1eT We have d + 1 terminal conditions in the form pk (T ) =
n eT X d (0) and it is so for any , where qXk (T ) = 0 for k = 0, 1, . . . , d 1, d + 1. Also, terminal
:= min{n , d } condition pd (T ) = qXd (T ) = holds.

4
V. S OLUTION Lemma 3. It holds p0 (t) = F (t) + G(t) where
 d
In order to solve the storage regeneration problem, we can
F (t) = 1 e(T t) e(T t)
write the Hamiltonian as
 X d 1 Z T t
d1
H(X, u, p) = c1 + p0 (t) u(t) X0 (t)p0 (t) + G(t) = c2 d (ev 1)k e(+d)v dv
d1
X k 0
k=0
+c2 (d i) Xi (t)
Next, we characterize solutions of the relaxed problem
i=0
d h
which correspond to feasible solutions.
X i
+ Xk (t) + (d k + 1) Xk1 (t) pk (t)
A. Pure Activation Cost
k=1
+pd+1 (t) (Xd (t) d)2 1 {d Xd (t)} (8) We start our analysis from the simpler case when the
transfer cost is negligible compared to the activation cost, i.e.,
We can hence derive from (6) the adjoint ODE system in the
c2 = 0. It is hence possible to derive explicit relations on the
costate variables
structure of the optimal control.
p0 = HX0 = 0 p0 d p1 c2 d (9)
Theorem 1. If c2 = 0, then a solution of the relaxed problem
p1 = HX1 = 1 p1 (d 1) p2 c2 (d 1) is a threshold policy, in particular:
.. i. Single switch: ton = 0 and 0 < toff < T iff 0 ;
.
ii. Null control: 0 = ton = toff iff > 0 , and m c1 , where
pk = HXk = k pk (d k) pk+1 c2 (d k) m = minv[0,T ] {p0 (v)};
.. iii. Double switch: 0 < ton < toff T iff > 0 , and m < c1
.
The critical value
pd1 = HXd1 = d1 pd1 pd c2 p
d
pd = HXd = d pd 2(Xd (t) d) 1 {d Xd (t)} pd+1 0 := max{0, log( d /c1 (1 eT ))}
T
pd+1 = 0 while the switching epochs write ton = max{0, T + 1 log zon },
In what follows, we will derive the structure of the solutions toff = T + 1 log zoff , where zon zoff are the two solutions for
of the optimal control problem. A bang-bang policy [22], [23] 0 z 1 of the equation
r
is one where u(t) takes only extreme values, that is u(t) = 1 c1
(1 z) = d z d
or u(t) = 0 a.e. in [0, T ].
Notice that bang-bang policies are very convenient for
implementation purposes since they rely only on a set of B. General case
switching epochs, where the control switches from 1 to 0 or In the general case, it is sufficient to characterize the
vice versa. A threshold policy is one in the form dynamics of the multiplier p0 (t) in terms of the extremal

0 ton < t T
points attained in the interior of [0, T ].
u(t) = 1 0 < t toff (10) Lemma 4. Let S() be the set of the interior extremal points


0 toff < t < T of p0 (t) for a given choice of the constraint multiplier . Then,
Threshold policies are convenient since they depend on a pair S() is one of the following forms: , {M }, or {m, M },
of parameters only, namely thresholds ton and toff . where m := p0 (tm ) denotes a minimum and M := p0 (tM ) a
Bang-bang structure. We observe that (8) is linear in the maximum, and it holds 0 tm < tM < T .
control u. Hence, because the optimal activation control min- Finally, as proved in the Appendix.
imizes the Hamiltonian, the optimal policy has to satisfy
 Theorem 2. The optimal solution of the relaxed problem is a
1 if p0 (t) < c1
u(t) = (11) threshold control.
0 if p0 (t) > c1
which depends on the dynamics of p0 , i.e., of the ODE system The optimal control is hence a threshold policy for which
(9). Actually, in order to prove that the policy is bang-bang the presence of an initial delay, i.e., ton > 0, depends on the
and non-degenerate, we need also to prove that the policy has parameters of the system. However, as a straightforward appli-
a finite number of switches and that there are no singular arcs, cation of the optimality principle, given an optimal threshold
i.e., no arcs where the Hamiltonian is null over an interval of policy with ton and toff , for a given pair T and r, the new
positive measure. threshold policy where ton = 0, toff = toff ton is optimal
for the problem where r = n Xd (tm ) r and horizon
Lemma 2. If the problem is feasible, the optimal policy is T = T ton < T . Thus we obtain the optimal solution in
bang-bang with no singular arcs. threshold form with no initial delay for more conservative
The dynamics of p0 can be derived in closed form: conditions, i.e., for smaller time horizon and larger number
of failed servers, and yet having same cost.

5
Algorithm 1: Optimal Regeneration Control regeneration technique under prescribed deadline constraints.
1: input: T , , , c1 , c2 , We have assumed a reference C = (n, k, d) MBR repairing
2: 0 s.t. u from (11) is such that Xd (T ) n code. The parameters of the code are n = 50, k = 10
3: initialize: R 0 , L 0, i 0 and d = 20 [17]1 . Also, the reference container state size
4: while |Xd (T ) n| > do is assumed B = 10 Gbytes. We recall that, based on the
5: Step i i + 1
6: i (L + R )/2 fundamental relation on MBR codes, we can derive the chunk
7: Obtain p0 (t), t [0, T ] solving backwards (9) size as = 2B/(k(2d k + 1)) [14], which in this case
8: Calculate the optimal control ui according to (11) amounts to = 64.5161 Mbytes.
9: Obtain Xd (t), t [0, T ] solving forward (2) The numerical setting is completed by assuming that re-
10: if Xd (T ) > n then
pairing servers may fail according to rate = 0.001s1 (we
11: R i
12: else remind that in our model server failures during restoration are
13: L i exponential random variables of parameter ). Furthermore,
14: end if the maximum rate at which repairing servers can be activated
15: end while is set as = 10 servers/s. Also, we need to make assump-
16: return (ui , i )
tions on the available network throughput: in our scenario,
the throughput available for repairing operations is 1 Gbit/s.
This value matches link speeds of production datacenters:
Note that, in the relaxed problem, we cannot exclude the peak bitrates for repair chunks transfer can be attained when
null control u 0, i.e., when p(0) > c1 and m > c1 . But, performing restoration in priority, i.e., giving highest priority
it cannot solve the constrained problem: to do so we need to to the traffic operating the transmission of repairing chunks.
determine the optimal multiplier , as seen next. The resulting target horizon for repairing has been set to
T = 3.5 s, which is feasible given the setting considered.
C. Optimal multiplier Fig. 2a and Fig. 2b depict the results of the optimal
The discussion so far has addressed the relaxed problem, activation control in case of simultaneous failure of r = 11
and the multiplier has been treated as a constant for the servers at time t = 0. We have reported on the dynamics
sake of discussion. However, determining the optimal solution of the costate variable p0 (t), superimposed to the switching
requires to identify a pair (u , ) where u solves the original threshold value, namely c1 (upper graph), the graph of the
constrained problem. corresponding optimal control dynamics (middle graph) and
The main result in this section is that we can calculate the one corresponding to the dynamics of the number of
the value using a simple bisection search as described in repairing servers Xd (t) (bottom graph).
Algorithm 1, under the feasibility assumptions of Lemma 1. In both cases, the optimal multiplier has been determined
The algorithm starts by exploring the interval for [0, 0 ], using Algorithm 1 with tolerance = 0.05. In particular, in
where 0 > 0 is a suitably large value such that it holds Fig. 2a we have considered the case of a null communication
Xd (T ) n. At line 5, 6 and 7 it solves the optimal control cost c2 = 0, which corresponds to = 12.7719 whereas
problem determining finally the terminal value Xd (T ) within in Fig. 2b we have considered c2 = 100 dollars/Gbyte, for
a certain tolerance > 0. which the optimal cost is attained for = 175.855. In both
The search algorithm leverages the fact that the terminal cases the threshold policy is such that the pair ton = 0 s and
number of repair servers is monotone in . In fact, when the toff = 1.22 s identifies the unique control driving he dynamics
target value number exceeds n, it explores on the left of the to satisfy terminal state constraint Xd (T ) = n.
current interval, i.e., it searches in [L , ]. Viceversa, when the Fig. 2c contains two tables calculated for different values
target value is below n, it explores the right interval [ , R ]. of the cost c1 and c2 . They report on the value of the optimal
The formal justification of the correctness of the above cost J (u ). We note that, as expected, it increases with both
search strategy, and the optimality of the output of the al- cost c1 and c2 . Also, we observe same behavior for : the
gorithm is resumed by the following result, proved in the optimal multiplier value increases and we ascribe this behavior
Appendix. to the fact that the value of has to enforce the terminal state
Theorem 3. Under the assumptions of Lemma 1, the optimal constraint against augmented running costs c1 and c2 .
pair (u , ) which solves the relaxed problem is unique, u
solves Prob.1, and can be approximated using a bisection VII. C ONCLUSIONS
search as in Alg. 1. In this paper we have presented an analytical framework
for the optimal control of state regeneration, a promising
VI. N UMERICAL R ESULTS technology in order to offer high availability of containerized
This section presents some numerical results on optimal applications at scale and ease stateful containers migration.
storage regeneration under a realistic parameter setting. It The idea is that leveraging the network filesystem, it is possible
also serves the purpose of explaining how to make use of
the proposed model to characterize limit performance of the 1 In [17] the code redundancy targets storage availability of 0.99

6
a) b)
0 Numerical 100 Numerical
c2
p (t)

p (t)
5 Theory 50 Theory J (u )
0

0
10 0 0 10 100
50
0 1 2 3 0 1 2 3 1 12.2 169.0 1580.6
1 1 c1 10 122.5 279.1 1691.9
u(t)

u(t)
0.5 0.5 20 244.9 401.3 1812.9
0 0 c2
0 1 2 3 0 1 2 3 0 10 100
1 1.2766 17.5851 164.0627
X (t)

X (t)
50 50
d

d
40 40 c1 10 12.8000 29.1024 175.8790
30 30 20 25.5990 41.7977 188.2813
0 1 2 3 0 1 2 3
t t
Figure 2. Optimal regeneration control a) zero communication cost b) c2 = 100 dollar/Gbyte c) The optimal cost and the optimal multiplier as function of
costs c1 and c2 .

to decouple the storage of containers state and the execution [4] J. Gray and D. P. Siewiorek, High-availability computer systems,
of application images running in pods. Computer, vol. 24, no. 9, p. 3948, 1991.
[5] Infinit International Inc, https://infinit.sh/documentation/reference.
We have studied optimal time-constrained regeneration, a
[6] Docker, Docker: The linux container engine, http://www.docker.io.
crucial aspect to ensure high availability in the containers state [7] D. Borthakur et al., Apache Hadoop goes realtime at Facebook, in
access. Under failure of a number of servers, regeneration is Proc. of ACM SIGMOD PODS, Athens, Greece, June 12-16 2011.
performed by transferring repairing chunks to newly deployed, [8] R. J. Chansler, Data availability and durability with the hadoop dis-
tributed file system, The USENIX Magazine, vol. 37, no. 1, February
clean slate repair servers. This occurs at a communication cost 2012.
and at a server activation cost. The optimal activation strategy [9] S. Ghemawat, H. Gobioff, and S.-T. Leung, The Google file system,
is of threshold-type and can be evaluated in closed form. SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 2943, Oct. 2003.
This work has been motivated by the limited number of [10] A. Lakshman and P. Malik, Cassandra: A decentralized structured
storage system, SIGOPS Oper. Syst. Rev., vol. 44, no. 2, pp. 3540,
studies on storage regeneration at system level [17] and it is Apr. 2010.
by no means conclusive. Indeed, several research directions [11] A. Cidon, S. Rumble, R. Stutsman, S. Katti, J. Ousterhout, and
are due in order to understand the potential of these novel M. Rosenblum, Copysets: Reducing the frequency of data loss in cloud
storage, in Proc. of USENIX ATC, San Jose, US, June 26-28 2013.
restoration techniques in cloud systems.
[12] A. Cidon, R. Escriva, S. Katti, M. Rosenblum, and E. G. Sirer, Tiered
The first one relates to the frequency of updates of the replication: A cost-effective alternative to full cluster geo-replication,
containers state, a design choice required in order to decide in Proc. of USENIX ATC, Santa Clara, CA, July 8-10 2015.
how often to dump the containers state onto the network [13] K. V. Rashmi, N. B. Shah, D. Gu et al., A solution to the network
challenges of data recovery in erasure-coded distributed storage systems:
filesystem. Such rate determines how much of the computation A study on the Facebook warehouse cluster, in Proc. of USENIX
already elapsed can be recovered using regeneration. HotStorage, San Jose, CA, June 27-28 2013.
Another relevant issue is the case of repeated failures. [14] N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran,
Actually, the information on where faults are more likely be- Distributed storage codes with repair-by-transfer and nonachievability
of interior points on the storage-bandwidth tradeoff, IEEE Trans.
comes available to the administrator over time, e.g., based on Information Theory, vol. 58, no. 3, pp. 18371852, 2012.
direct observation or online learning techniques. The optimal [15] D. S. Papailiopoulos and A. G. Dimakis, Locally repairable codes,
policy may in turn span several cycles of faults/restorations IEEE Trans. Information Theory, vol. 60, no. 10, pp. 58435855, Oct
2014.
and would account for techniques to learn the aposteriori [16] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis et al.,
distribution of faults over which to operate the optimal control. Xoring elephants: novel erasure codes for big data, in Proc. of PVLDB,
Also, correlated faults described in this work are simul- Riva del Garda, Italy, August 26-30 2013.
taneous. In reality, they may be scattered in time, e.g., due [17] S. Jiekak, A.-M. Kermarrec, N. Le Scouarnec, G. Straub, and
A. Van Kempen, Regenerating codes: A system perspective, SIGOPS
to cascading failures. Under such fault dynamics, the optimal Oper. Syst. Rev., vol. 47, no. 2, pp. 2332, Jul. 2013.
control studied in this work may be suboptimal. New models [18] H. Weatherspoon and J. Kubiatowicz, Erasure coding vs. replication:
should identify how to counter the effect of later additional A quantitative comparison, in Proc. of IPTPS, Cambridge, MA, USA,
March 7-8 2002.
faults occurring during regeneration. [19] O. Khan, R. C. Burns, J. S. Plank, W. Pierce, and C. Huang, Rethinking
erasure codes for cloud file systems: minimizing I/O for recovery and
R EFERENCES degraded reads, in Proc. of USENIX FAST, San Jose, US, February
14-17 2012.
[1] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, Borg,
Omega, and Kubernetes, Comm. of the ACM, vol. 59, no. 5, pp. 1837 [20] C. Huang, H. Simitci, Y. Xu et al., Erasure coding in Windows Azure
1852, May 2016. storage, in Proc. of USENIX ATC, Boston, MA, June 26-28 2012.
[2] W. Li and A. Kanso, Comparing containers versus virtual machines for [21] E. Altman, L. Sassatelli, and F. De Pellegrini, Dynamic control of
achieving high availability, in Proc. of IEEE IC2E, Tempe, US, March coding for progressive packet arrivals in DTNs, IEEE Trans. on Wireless
9-12 2015. Comm., vol. 12, no. 2, pp. 725735, 2013.
[3] V. Salapura, R. Harper, and M. Viswanathan, ResilientVM: High [22] G. Leitmann, An introduction to optimal control. McGraw-Hill, 1966.
performance virtual machine recovery in the cloud, in Proc. of ACM [23] D. E. Kirk, Optimal Control Theory. An Introduction., 13th ed. Prentice
AIMC, Bordeaux, France, Apr 21-24 2015, pp. 712. Hall, 2004.

7

A PPENDIX Qd (s) = s+ , it follows
P ROOF OF L EMMA . 1 d1  
d d! c2 d X d 1 i+1 i!
Q0 (s) = Qd + Q
Proof: Feasibility is indeed equivalent to X d to respect k s i=0 i k h

T d k=0 (s + ) h=0 (s + )
the constraints. Condition 1 e d X d (0) ensures
In order to obtain Qthe closed form of p0 (t), auxiliary expres-
that sup X d (T ) n, which is attained for = 0. We k
sions of the kind h=0 (s + )h have to be inverted. Let us
observe that d is also well defined: g() = inf t[0,T ] X d (t)
is a continuous function of . Because inf g() = 0 and
denote fh (t) := e h t
1 {t 0}, for the sake of notation. By
recalling L{et 1 {t 0}} = 1/(s + ), it is possible to
g(0) = Xd (0) d, there exists a value of that satisfies
calculate
the definition. The statement follows immediately from the ( k )
ei t 1 {t 0}
Y Xn
definition of d and n and from a continuity argument. 1 h Qn
L (s + ) = f1 . . . fn =
j=0 j i
h=0 i=0
P ROOF OF L EMMA . 2 j6=i

ei t 1 {t 0} X (1)ni
n
X n
Proof: Preliminarily, let observe that a feasible solution = Qn = fi (t) (13)
must be such that Xd (t) d, for t [0, T ]. Thus, the dual i=0
n j=0 i j i=0
n i!(n i)!
j6=i
ODE system has to be solved as in the non-augmented case,
The statement follows after some algebraic manipulations of
where it holds pd = d pd . Hence, since the Hamiltonian is
the above expression.
linear in the control, a feasible policy is a bang-bang one.
In order to exclude the presence of singular arcs, we need to
exclude the possibility that c1 +p0 (t) = 0 over an interval I of P ROOF OF T HM . 1
positive measure. We shall prove that multiplier p0 s cannot be Proof: From Lemma 3, if c2 = 0, it follows
a constant over any interval I of positive measure (I) > 0,  
and this guarantees that the control is actually bang-bang [22]. p0 (t) = e(T t) (1e(T t) )d1 +(+d)e(T t)
Let assume that p0 is a constant p0 = c1 over interval I: from which it is immediate to observe that the absolute mini-
hence all its k-th order derivatives vanish in I. But, it follows 
mum over the real line is attained at tmin = T 1 log 1+ d ;
from (9) that p1 = d0 p0 : thus p1 is also a constant over I, and d d+
1 the minimum writes m := ( d ) /(1 + d ) .
since p2 = (d1) p1 , p2 as well. We hence iteratively obtain
Switching epochs ts are determined by the instants solving
that pi is a constant for i = 1, , 2, . . . , d. However, pd = d pd ,
p0 (ts ) = c1 . First, let observe that p0 (T ) = 0 > c1 and
so that 0 = pd = pd1 = . . . = p1 . Finally, p0 = 0, which is
p0 (T ) = d, so that the control is indeed null in a left interval
a contradiction.
of T . In particular, it is possible to identify three cases: for
P ROOF OF L EMMA . 3 a given value of > 0, there might exist either two, one or
zero switching epochs in the interior of [0, T ]. We consider
Proof: The adjoint ODE system can be solved via Laplace the three cases separately.
transform. We make the replacement qk (v) = pk (T t), thus
Case i: single switch. The condition for a unique switching
considering the backward time variable v = T t. It holds
epoch is p0 (0) < c1 , which writes (1 eT )d eT > c1 ,
qk (v) = pk (t), so that system (9) writes
so that
q0 = 0 q0 + d q1 + c2 d d p 
.. > log d /c1 (1 eT ) := 0
T
.
By inspection of (14), due to the continuity of p0 , there exists
qk = k qk1 + (d k) qk+1 + c2 (d k), switching epoch 0 < toff < T such that p0 (toff ) = c1 . Because
k = 1, . . . , d 1 p0 (t) has unimodal structure, such switch is unique so that the
.. corresponding optimal control is in threshold form. Namely,
. u(t) = 1 for 0 t < toff and zero otherwise.
qd = d qd + 2(Xd (t) d) 1 {d Xd (t)} qd+1 Case iii: two switches. Condition p0 (0) > c1 leads to a
qd+1 = 0 non-null control if and only if m < c1 . From the unimodal
structure of p0 , and from classic continuity arguments, there
Let Qk (s) = L{qk (v)} for t = 0, 1, . . . , d be the Laplace
exist two real values, namely ton < tmin < toff where p0 (ton ) =
transform of the k-th variable qk . The corresponding system
c1 = p0 (toff ), so that u(t) = 1 for ton < t < toff and zero
writes
otherwise.
1
sQk (s) = k Qk (s) + (d k)Qk+1 (s) + c2 (d k) , Case ii: no switch. This is the case ton = toff = 0, i.e., the
s
for k = 0, 1, . . . , d 1 optimal control is the null one. It occurs when p0 (0) > c1
and m c1 .
sQd (s) = d Qd (s) (12)
Finally, the explicit expression of the switching epochs is
(dk) c2 (dk)
from which Qk (s) = (s+k ) Qk+1 (s) + s(s+k )
is obtained. obtained by solving equation p0 (t) = c1 , which concludes
By iterative replacement, and by accounting for the fact that the proof.

8
P ROOF OF L EMMA . 4 vi. S = {m, M } with p0 (0) > c1 with m > c1 implies
Proof: It is possible to write the derivative of the mul- a two-switch control with ton > 0 and 0 < ton < toff < T .
tiplier p0 (t) in a convenient form. For notations sake, we This concludes the proof, since in all cases the optimal bang-
denote p0 (t) the expression of p0 (t) when c2 = 0, and tm bang control is a threshold policy.
the point (on the real line) where the minimum of p0 (t) is
attained. We hence obtain P ROOF OF T HM . 3
p0 (t) = p0 (t) c2 de(+)(T t) (14)
Proof: In this proof we need to make the dependence on
where we know that p0 (tmin ) = 0, p0 (t) < 0 for t < tmin and explicit in the notation: e.g., u is the optimal control when
p0 (t) > 0 for t > tmin . multiplier is adopted in the relaxed objective function J (u).
However, p0 (T ) = 0 and p0 (t) = c2 d < 0, so that i. The fact that pair (u , ) minimizing J (u) is unique
there exists a whole left neighborhood of T where p0 (t) > 0 follows from the expression J (u) = J(u)+(nXd(t)). Let
and decreasing. And, p0 (t) < 0 for t < tmin . assume by contradiction another pair (u, ) is optimal, then it
By taking into account the sign of p0 and the additional must hold J(u ) = J(u). However, this implies that the two
negative term appearing in (14), it is immediate to conclude threshold policies must be identical, i.e., u = u, and so also
that only the following three cases are possible: = , because of the linear dependence with multiplier in
i S = : in this case p0 (t) is strictly decreasing in [0, T ]; (3).
ii S = {M } and the maximum is attained at 0 < tM < ii. The fact that the relaxed problem solves for the optimal
T : in this case p0 (t) strictly increasing in [0, tM ] and solution of the original constrained minimization follows from
decreasing in [tM , T ]; the following argument. Let define Ufn = { u| Xd(T ) =
iii S = {m, M } otherwise, where M is attained at 0 < n, Xd (t) d, t [0, T ]} U, let be the optimal
tM < T and m is attained at 0 < tm < tM < T ; i.e., multiplier and u the optimal solution of the constrained
in this case p0 (t) is decreasing in [0, tm ], increasing in problem.
[tm , tM ] and then decreasing in [tM , T ];
J(u ) = minn J(u) = minn J(u) + (n Xd (T )) = J (u )
which concludes the proof. uUf uUf

P ROOF OF T HM . 2 where the equality follows from the fact that (nXd(u)) = 0
over set Ufn .
Proof: From Lemma. 4, the structure of the control can be
iii. The correctness of the bisection search is due to the fact
analyzed exhaustively counting the possible switches induced
that J (u ) is indeed monotone in . In fact, costate variable
by the dynamics of p0 (t), similarly to what has been done in
Thm. 1: p (t) = Fe(t) + G(t)
0
i. S = implies the null control, i.e., u 0, i.e., ton = where we have made explicit the dependence on appearing
toff = 0; in (3). Now, with respect to switching epoch toff , let us consider
ii. S = {M } and p0 (0) c1 implies the null control; multiplier + , for some > 0. Then we can write
iii. S = {M } and p0 (0) < c1 implies a single switch
p+ (toff ) = Fe (toff ) + G(toff ) F (toff ) < 0
0
control with ton = 0 and 0 < toff < T ;
iv. S = {m, M } and p0 (0) > c1 with m > c1 implies which implies toff < t+ off . Opposite holds for ton : ton >
+
the null control; ton . From direct inspection of the cost function, it follows
v S = {m, M } with p0 (0) < c1 implies a single-switch J (u ) < J+ (u+ ), which proves the claimed monotony
control with ton = 0 and 0 < toff < T ; argument.

You might also like