Professional Documents
Culture Documents
Course Instrutors: Dick Epema, Distributed Systems Group, section VI the experimental setup is presented together with
EEMCS, TU Delft, Email: D.H.J.Epema@tudelft.nl the evaluation of specific features of the system. We end our
Lab Assistant: Apourva Parthasarathy, Email: report with section VII, containing our conclusions.
a.parthasarathy@student.tudelft.nl
II. BACKGROUND ON CTC-S WARM
Abstract—In this report, we present the CTC-Swarm, a cloud
resource manager that runs multiple data retrieval tasks as a CTC-Swarm is an application that can be used for many
preparation for automated crypto-currency trading. The software purposes. However, in this report, we focus on the data
automatically deploys the tasks on the cloud, monitors them collection tasks, especially online information related to the
and restarts in case of errors. Two provisioning policies are trading of crypto-currencies. This information will be further
introduced, and tests show that a choice needs to be made fed to an automated trading application developed by the
between lower costs and higher execution speed in order to choose WantCloud company. Therefore, the purpose of CTC-Swarm
the optimal policy. We also manage to lower the downtime of is to manage 3 data retrieval tasks which are provided by the
streaming services to one minute in case of error, by constantly
observing the pressure of the incoming data.
company:
Keywords—IaaS, Cloud, Cluster Management 1) News extraction from websites: the task downloads
a provided RSS feed with an arbitrary number of
I. I NTRODUCTION elements, specifying the most current crypto-currency
related articles, their addresses, titles and release
Cloud services have caused a revolution in the dynamics of dates. For each article, a dedicated crawler ex-
IT industry by providing scalable and virtualizable resources. tracts the content and stores this information in the
Infrastructure as a Service (IaaS) is one of the three types of database. Once all the provided URLs are crawled,
cloud service models, which supplies its users with access to the program terminates.
virtual computing resources such as server space or hardware 2) Tweet streaming: This task consists of streaming,
[1]. However, oftentimes the user of such an infrastructure does processing and storing of tweets that contain certain
not receive support in deploying and managing his application. pre-defined keywords. From each object, the program
In this report, an application is introduced that orchestrates extracts the text of the body, converts and performs
and manages the extraction of large amounts of data from the the pre-processing. Finally, each tweet is stored in the
internet. This data is used as input for an automated trading database.
algorithm for crypto-currencies. In such trading applications, 3) Trade information processing: this program continu-
it is crucial to be able to get data as soon as it is available and ously streams the current data regarding trades exe-
employ it. Because of this requirement, the resource usage of cuted on one of the online crypto-currency markets
our application can vary from time to time. and summarizes them forming one-minute windows
of the data. Therefore, every minute thousands of
To be able to efficiently cope with these peaks in demand, trades are loaded, accumulated, and stored in the
this report presents CTC-Swarm. CTC-Swarm is an auto- database.
scaling resource management system that schedules and scales
computing resources based on demand. The software utilizes CTC-Swarm is responsible for the execution of these jobs
EC2 computing instances from Amazon Web Services (AWS) and needs to monitor and optimize the resource usage of the
in order to run several data extraction and data mining appli- application.
cations. Further, it runs scripts packed into docker containers
in an automated and elastic fashion and allows monitoring and III. S ERVICE L EVEL O BJECTIVES
scheduling of the jobs.
It is crucial for CTC-Swarm to ensure that any employed
Therefore, the report is structured as follows. We start with finite task (i.e. the news extraction) is executed successfully
the presentation of the overview of CTC-Swarm’s main fea- and any continuous streaming task (e.g. the tweet streaming)
tures and capabilities in section II. We then describe our service break for too long, even if an error occurs. We define a
level objectives in section III. In section IV the architecture of Service Level Objectives (SLO) for our application to make
our software is presented. In section V we present the imple- this requirement measurable. For the finite task, the SLO would
mented features and characteristics of CTC-Swarm. Further, in be to successfully finish any task scheduled or delegated by a
user. In case of an error, the task needs to be restarted and Further more, the Manager further provides the interface
finished. When it comes to the streaming jobs, the SLO of run(policy), allowing to define which policy (see V-A)
our application is to stream the data continuously. Whenever the manager should follow while processing those tasks from
a break occurs, this break may not be longer than one minute. the queue.
This means that, if, for example, a Tweet streaming task
The AWSManager class handles resources we de-
crashes, our application needs to restart this Tweet streaming
mand from Amazon Web Services (AWS), such as RDS
task in: less than a minute. Finally, an SLO violation is any
databases, EC2 instances and their utilization using Cloud-
violation of the two previously defined objectives. They are
Watch. Therefore we use the API wrapper library boto3[2].
counted by taking a sum over the number of not successfully
Next to the obvious functionality that lists currently run-
executed continuous tasks added and the number of minutes
ning resources, more complex operations are implemented
that the streaming tasks did not return any data.
as well. For example, launchEC2Instance(name) and
launchRDSInstance are serving the purpose of provi-
IV. S YSTEM DESIGN sioning a resource, EC2 or RDS respectively, and ensure that
Our application uses a micro-service architecture. Each of the resource is being started correctly. As those functions are
the 3 tasks is designed as a standalone dockerized application time-consuming, hence blocking, we used Pythons asyncio
and in our case deployed on an EC2 computing instance from library in order to work in asynchronous fashion. Once the
AWS. Multiple applications can, therefore, be deployed on one Future results in the expected result, the instances are ready
instance. for services to be deployed on.
The DockerManager is responsible for handling con-
A. Interaction tainer (e.g. service) related concerns. We, therefore, rely on
the Docker API wrapper library docker-py[3]. This includes
Figure 1 contains a high-level overview of the architecture
the deployment of our images: ctc-news, ctc-trades-watch,
of our application. There are three types of initiators which
ctc-twitter and are deployed with the functions: deployNews,
start an interaction with the underlying components:
deployTrades, deployTwitter. In order to deploy any of the
1) Users interact with the application through a RESTful images, the DockerManager needs to connect to a docker
API (see description in IV-C) whose requests are installation on one of the deployed instances on AWS. This
handled by the Request Handler. This class can be done in two ways. The simple approach is to con-
translates user requests for the Manager, which nect to the machine which is passed directly to the class
manages the task execution (see IV-B). In case the using the function setClientByInstance(instance).
user wishes to schedule future tasks upon request, However, there are some cases where we have multiple
the translated task, which is to be completed, is sent instances running and wish to connect to the most suit-
to the task queue S together with the UTC-timestamp able one based on some strategy. Therefore, the func-
of its preferred execution. tion setClientByLowestContainers(instances)
The Scheduling thread, not only plans future allows to connect to the one instance which currently maintains
tasks but is also responsible for maintaining the the least running docker containers.
company’s SLO’s (III). That is, by approaching the Once services are deployed and running, the
StatisticsManager for past container utiliza- StatisticsManager provides essentially two purposes:
tion, CPU utilization and application specific infor- report and retrieve pressure for services. That is, reporting
mation such as trade-, news- and tweet pressure. pressure is handled by the monitoring thread (running on
2) Therefore, the Monitoring thread continuously the server) and connects to the Results-Database, determines
observes utilization of infrastructure and applica- pressure for trades, tweets, or news, and writes the observed
tions and reports to the internal database. Specif- pressure into our Application-Database. We define the table
ically, it uses the AWSManager to retrieve CPU that reports pressure for an arbitrary service s as follows:
utilization, the DockerManager to observe the pressure(s): id, timestamp, minute,
number of deployed containers per instance and hour, option
the StatisticsManger to report the pressure of The ids and timestamps are incrementing, the minute and
trades, tweets and news per minute and per hour. hour fields describe the pressure for the last minute or hour
and the option field allows to define e.g. the instance (for
B. Components CPU pressure) or the market (for trade pressure) that was
observed. Retrieving the pressure is required in the scheduler
The Manager class serves as an intermediary for handling
and simply reads out previously reported pressure of a service.
interactions on both AWS and Docker. Its other duty is to
provide a task queue T, where provisioning tasks can be The Scheduler class is an extension used by the
managed using the interfaces addTask and removeTask. Scheduling thread and serves a queue S that keeps track of
We define a Task to be a tuple consisting of a service all tasks that need to be executed in the future. The function
(described within the DockerManager) and an operation addTask(task, timestamp) therefore accepts the to be
(either start, stop or restart): Task: (Service, executed task and the timestamp of its execution. The task
Operation). The reason for handling tasks on manager level again is a tuple of a service and its operation, whereas the
instead of docker level is the fact that services can run on an service is in our case a docker container. In addition, we also
arbitrary AWS instance, depending on the provisioning policy. provide capabilities of accepting cron tasks, by specifying an
Hence, the manager holds a dependency for both classes. interval instead of a timestamp: addTask(task, cron)
C. REST API be minimum for the threshold to run the number tasks in the
queue. Then, in each iteration it first checks whether any of the
As mentioned in IV-A, CTC-Swarm exposes a restful VMs can be closed or any new ones need to be booted based
API such that users can interact with the infrastructure. The on the current number of tasks running and the queue size.
endpoints can be described as follows: Then, it tries to run as many tasks as possible on the instances
POST /manager/strategy: name, args with the lowest number of containers running. However, the
POST /task: service, operation main limitation of this policy is the fact that it does not take
DELETE /task: service, operation into consideration how resource demanding some applications
DELETE /allTasks may be. This problem is addressed by the second policy.
POST /schedule/task: service, operation
As on can see from the naming of the endpoints, the API serves b) Utilization threshold of the VMs: The second policy
a subset of the capabilities provided by the components de- takes two parameters as input. It requires the maximum uti-
scribed in IV-B. The RequestHandler therefore translates lization threshold, which specifies what level of utilization of
the requests and calls the appropriate function in the specific a VM the manager will assign tasks to it, and initial instance
class. allocation that is a guess of the user of how many containers
the instance can handle at the same time. However, the second
V. I MPLEMENTED FEATURES parameter is used only in case of many tasks being added
to the queue at the same time, making the manager start
A. Automation booting multiple instances, instead of booting one at a time,
The automation part of this project involved the handling of and waiting for it to be fully utilized. The policy performs
RDS instances, EC2 instances and Docker containers, laying similar steps in each iteration, the only modifications are that
out a foundation for other features to rely on. it additionally boots a new instance if all instances reach the
utilization limit and assigns tasks to instances with the lowest
The AWSManager provides interfaces to communicate utilization. The policy improves upon the previous policy in
tasks to concerning a resource provided by Amazon AWS. The terms of monitoring the CPU usage of the instances and
full class documentation is listed in Appendix -A. As one can possible cost savings.
see from the function descriptions, the functionality is mostly
concerned about launching, reading and terminating instances. These two policies are evaluated in the next section for
multiple parameter sets.
The DockerManager provides interfaces to handle con-
tainers (to be) deployed on EC2 instances. Much of the
C. Reliability
functionality we use is already provided by the API, however,
our project involves, for example, the deployment of containers The CTC-Swarm implements the reliability feature, by
which require extensive configuration at deployment. There- making sure that the streaming tasks are down for an as short
fore Appendix section -B describes the from us developed time as possible and the finite tasks are quickly restarted in
interfaces. case of an error. This is done on two levels.
Whereas the previous two classes describe infrastructure 1) Firstly, the Manager class does it at the container
functionality, the class StatisticsManager is more tar- level. In each iteration checks, whether any of the
geted to observe and report application-level behavior. As it Docker containers running has the failed status,
can be seen from Appendix section -C, for every type of meaning that there was a tasks error. Such containers
service there is a reporting function as well as a getter. are restarted directly.
2) Secondly, the Scheduler does it in a streaming
B. Elasticity / Load-balancing application-specific way, by monitoring the rate of
When it comes to the elasticity, the Manager takes care of incoming trades and in case of a drastic decrease, it
booting new instances whenever necessary and closing them warns the Manager. The Manager checks again the
whenever they are idle and there are no jobs to be executed. status of the container, and if Docker indicates it is
The strategy the manager uses in order to launch instances and running correctly and it has not just been restarted by
containers is what we call the Provisioning Policy. Depending the Manager, the restart process is invoked. Thanks to
on the policy, the Manager manages the VMs differently. this, we make sure that even if there was an error, that
We set a maximum of 10 VMs that can run at the same did not cause the container to fail, while negatively
time, preventing our application from unexpectedly becoming influencing the streaming, the task gets restarted.
more expensive than anticipated. The evaluation of our tests
VI shows that this number is sufficient. Two main policies D. Monitoring
managing the load balancing: one based on the threshold of
instances per VM, while the other one based on the utilization The monitoring feature is implemented in CTC-Swarm
threshold of the VMs. These will be discussed in more detail on the application specific level for the streaming tasks. The
in the upcoming paragraphs: system monitors the number of tweets and trades for different
markets streamed per minute and stores it them into a separate
a) Maximum number of tasks run on a single instance: database. Additionally, this allows for monitoring whether
The first policy takes the maximum number of containers to some of the streaming tasks have been broken and restarting
be run on a single VM as a parameter. Once the Manager runs these tasks by the manager if the rate of streamed data is too
with this policy, it boots as many instances as it expects would low.
Fig. 1. CTC-Swarm Architecture Overview
1) Stopping the monitoring of a specific service which TABLE IV. E VALUATION OF DOWNTIME (DT) AND NUMBER OF
VIOLATIONS (V)
leads to an immediate pressure drop.
2) Then, the longest time between two consecutive Service
Pressure DT: observe DT: observe DT: observe
Window every 1min every 2min every 5min
tweets or trades is measured, which in comparison to Trades(GDAX) 3min ∼180s / 3V ∼210s / 4V 300s / 5V
general benchmarks, may indicate how long it took Trades(Bitfinex) 1min ∼75s / 2V 120s / 2V 300s / 5V
for CTC-Miner to automatically restart the instance Trades(Bitstamp) 1min ∼90s / 2V 120s / 3V 300s / 5V
and how many tweets or trades are lost in that time. Tweets 5min ∼348s / 6V ∼354s / 6V ∼445s / 8V
VII. C ONCLUSION
In this report, we designed and evaluated the cloud resource
manager CTC-Swarm. CTC-Swarm is able to orchestrate the
execution and the scheduling of 3 different types of data
extraction tasks. The application is also able to scale in
resources whenever necessary. We described the architecture
and implemented several features.
We evaluated CTC-Swarm with 3 types of tests: benchmark
tests, tests on different provisioning policies and reliability
tests. The tests show that, when choosing a provisioning policy,
a trade-off is done between increasing the performance and
lowering the costs. The tests also show that, in order the
minimize SLO violations, the pressure window of streaming
services needs to be chosen based on the different pressure
loads of services that are streamed.
While developing a cloud-based application we came to
experience that it is very difficult to think of all the edge
cases. For example, the simple task of starting an instance
is not that trivial considering that the amount of time until the
instance is fully ready for containers to be deployed on can
differ a lot. We also noticed that the computing power that
was provided over our free-tier AWS instances can vary. We
solved this problem by taking the averages of the experiments
and introduced extensive error handling within the code.
R EFERENCES
[1] S. Goyal, “Software as a Service, Platform as a Service, Infrastructure
as a Service A Review,” International journal of Computer Science
Network Solutions, vol. 1, no. 3, pp. 53–67, 2013.
[2] “Boto 3 documentation.” [Online]. Available:
https://boto3.readthedocs.io/en/latest/
[3] “docker-py.” [Online]. Available: https://github.com/docker/docker-py
C LASS D OCUMENTATION
A. AWSManager
NAME
aws manager
CLASSES
builtins . object
AWSManager
c l a s s AWSManager ( b u i l t i n s . o b j e c t )
| Methods d e f i n e d h e r e :
|
| init ( self )
| I n i t i a l i z e s e l f . See h e l p ( t y p e ( s e l f ) ) f o r a c c u r a t e s i g n a t u r e .
|
| deleteRDSInstance ( s e l f )
|
| getAllInstances ( self )
|
| getAllInstancesReady ( self )
|
| getInstanceById ( self , id )
|
| getInstanceUtilization ( self , instanceId )
| D e t e r m i n e s l a t e s t known a v e r a g e CPU u t i l i z a t i o n u s i n g CloudWatch
|
| R e t u r n s o n l y t h e l a t e s t known o b s e r v e d t i m e s t a m p .
|
| getInstancesById ( self , ids )
|
| getRDSInstance ( s e l f )
|
| isInstanceReady ( self , instanceId )
|
| l a u n c h I n s t a n c e ( s e l f , name )
|
| p r o v i s i o n I n s t a n c e ( s e l f , i n s t a n c e T y p e = ’ t 2 . micro ’ )
|
| p r o v i s i o n R D S ( s e l f , i n s t a n c e C l a s s = ’ db . t 2 . micro ’ , u s e r , p a s s w o r d )
|
| terminateAllInstances ( self )
|
| terminateInstance ( self , instance )
|
| terminateInstances ( self , ids )
|
B. DockerManager
NAME
docker manager
CLASSES
builtins . object
DockerManger
c l a s s DockerManger ( b u i l t i n s . o b j e c t )
| Methods d e f i n e d h e r e :
|
| init ( self )
| I n i t i a l i z e s e l f . See h e l p ( t y p e ( s e l f ) ) f o r a c c u r a t e s i g n a t u r e .
|
| close when empty ( s e l f , i n s t a n c e )
|
| deployCandleProcessor ( s e l f , containerName )
| Available containers :
| c t c −t r a d e −c a n d l e s −b i t s t a m p −1m
| c t c −t r a d e −c a n d l e s −gdax −1m
| c t c −t r a d e −c a n d l e s −b i t f i n e x −1m
| c t c −t r a d e −c a n d l e s −b i t f i n e x −15m−s h i f t e d
| c t c −t r a d e −c a n d l e s −gdax −15m−s h i f t e d
| c t c −t r a d e −c a n d l e s −b i t s t a m p −15m−s h i f t e d
| c t c −t r a d e −c a n d l e s −gdax −60m−s h i f t e d
|
| deployNews ( s e l f , s e r v i c e N a m e = ’ c t c −news ’ )
|
| deployTrades ( s e l f )
|
| deployTwitter ( self )
|
| getAllRunningContainers ( self , instances )
| L i s t s running c o n t a i n e r s fo r a l i s t of i n s t a n c e s .
|
| A z i p l i s t i n t h e form o f : [ ( i n s t a n c e , [ c o n t a i n e r s ] ) ] i s r e t u r n e d .
|
| getClient ( self )
|
| getClientByInstance ( self , instance )
|
| getImageName ( s e l f , c o n t a i n e r , v e r s i o n = F a l s e )
|
| getNumOfRunningContainers ( s e l f , i n s t a n c e )
|
| getRunningContainers ( self , instance )
| L i s t s t h e r u n n i n g c o n t a i n e r s f o r an i n s t a n c e
|
| notifyWhenLessThanTargetNumOfContainers ( s e l f , instance , t a r g e t )
|
| restartContainer ( self , container )
|
| setClient ( self , client )
|
| setClientByInstance ( self , instance )
|
| setClientByLowestContainers ( self , instances )
|
| toEnviron ( self , l )
| C o n v e r t s and r e s o l v e s a l i s t o f e n v i r o n m e n t v a r i a b l e s t o a d o c k e r
| environment entry .
| E . g . [ ’ CTC DB ’ ] r e s u l t s i n : [ ’ CTC DB= v a l u e ’ ]
C. StatisticsManager
NAME
statistics manager
CLASSES
builtins . object
StatisticsManager
DATA
db = <peewee . MySQLDatabase o b j e c t >
news db = <peewee . MySQLDatabase o b j e c t >
t r a d e d b = <peewee . MySQLDatabase o b j e c t >
t w i t t e r d b = <peewee . MySQLDatabase o b j e c t >
T IME SHEET
D. Report and Code
• the total-time = 200h
• think-time = 50h
• dev-time = 60h
• xp-time = 40h
• analysis-time = 10h
• write-time = 20h
• wasted-time = 20h
E. Experiments
1) Experiment 1:
• total-time = 5
• dev-time = 4
• setup-time = 1
2) Experiment 2:
• total-time =15
• dev-time =10
• setup-time =5
3) Experiment 3:
• total-time = 15
• dev-time = 10
• setup-time = 5