You are on page 1of 22

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC

Press/Taylor & Francis; 305p



Prepared by Ariadna73 Page 1 of 22
Foreword by Thomas Davenport
Big Data gives competitive advantage
Appeal of Big Data: Combining diverse data types and formats.
Marketing creation and analytics is a big part of Big Data
Huge advancer in Healthcare at lower costs
Big Data is being combined with traditional forms of analytics
With this book the reader will learn about all the technologies needed to establish a platform for Processing Big
Data
1. Introduction to Big data era (Stephan Kudyba and Mathew Kwatinetz)
Description of Big Data
Industries that use BD
Marketing
- Elections: Using with media to campaign
- Investment and with media
- Commerce and loyalty data
Healthcare
- Predict the spread of disease
- Descriptive power and predictive pattern matching
Real estate
- Building energy disclosures
- Smart meters
Transportation (GPS)
- Intelligent transport application
- Crowd-source crime fighting: apps to upload crime-related information
- Pedestrian traffic patterns in retail
Energy (metrics)
Retail
- Estimate sales
- Forecast the price of online tickets
Sensors in various products
- Cell phones
- Fitness devices
- Even appliances!
Sources are numerous
Structured
Unstructured
- website links
- Product reviews
- Text
- Pictures/Images
- Twits
- Emails
Velocities (How quickly is data being generated, communicated and stored)
Real time: high-velocity or fast moving data
Source of more descriptive variables
Building blocks to decision support
Unless data can help decision makers, there is little value to it
Leveraging data through analytics: The most important thing to know is which questions is BD. going to
answer
Impact trend: ability to leverage data variables that describe activities/ processes
The value of data: BD, enhances the decision - making process
Value is found after normalizing, calculated or categorized
Ethical considerations in the Big Data era: Users must be aware at all timer of their data being collected,
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 2 of 22
2. Information creation through analytics (Stephan Kudyba)
End result of Analytics (Basic concept): Extract / generate information to provide a resource to enhance the
decision-making process
Business Intelligence
Query and report creating
- Spreadsheets
Online Analytic Processing
- Provide multidimensional view of an activity
- Also considers the source of the data as a variable
- Pivot tables: Leverage data in a flat file to present alternative scenario
Analytics at a glance through dashboards: Be careful and don't make them overwhelming
Multivariate analysis
- Robust BI and drill down behind dashboard views
- Regression
- Data mining apps
Neural networks
Clustering
Segmentation classification
Real time mining
Analysis of unstructured data: Text mining
Six sigma
Focus: Reduce variability in operations
Visualization
Simple graphics Types
Real time mining and big data
Analysis of unstructured data and combining Structured and unstructured sources
Complex Event Processing (CEP)
Event Stream Processing (ESP)
Data mining and the value of data
The first question to ask is what is in a data file?
Why things are happening
Correlations
Fraud detention
Risk assessment
Outcomes and other areas of healthcare
What is likely to happen
Value of data and analytics
Efficiency
Cost reduction
Productivity
Profitability

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 3 of 22
3. Big Data Analytics (BDA) -Architectures, implementation, methodologies and tools
(Wullianallur Raghupathi and Viju Raghupathi)
BDA in emerging as a sub-discipline of Business Analytics
BDA in being used to analyze and gain insight to make informed decisions
Primary characteristics
- Volume
- Velocity
- Variety
- Veracity (added by some practitioners)
Big data analytics is changing data analytics in companies
In the future it will be widespread
Key: utilization of distributed processing and new gathering and storing tools
Very large data sets have existed forever, but now we have better tools to store them
Also, social media Sources are new
This approach is un precedent
Challenges
Typical limitations of Open Source
Lag between when the data is collected and processed
Web services delivery mechanisms need more development
Privacy, Safeguarding security, establishing Standards and governance
Architectures, Frameworks, and tools
The conceptual framework en similar to other business Analytics project with some differences
1. Processing: with such a big data set, the procesing must be broken down in multiple systems
2. Proliferation of open-source platforms such as Hadoop/MapReduce encourages the use of multiple
domains
3. The user interfaces are entirely different
Applied conceptual architecture
BD needs complex tools to clean the data and then apply any of those fancy software apps to its four typical
applications: OLAP, Reporting, Queries, and Data Mining
Hadoop (most popular)
- It is a NoSQL type of tool developed in Apache
Allocates portions of the data in different servers to be processed, and then integrates the results
Has stimulated other Apache developments such as
Zookeeper: coordination services
Open source that allows a centralized infrastructure with various services
Uses Java C interfaces
Hbase
Column-oriented database management system that sits on top of HDFS (the Hadoop file system)
Name Node (Master): Manages the cluster in Hbase
Slave nodes
JobTracker
TaskTracker
Does not support SQL
Developed in Java. Supports Avro, REST, Thrift
Cassandra and A-Base: Databases
A distributed database system
Top-level project
2 million columns in a single row
Built on a distributed architecture named Dynamo
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 4 of 22
Does not have a master mode so it can be more easily queried
Apache Avro: Data sterilization
Facilitates data serialization services
Includes the schema in the data file
Supports versioning
Mahout: Machine-learning library
Chukka: Monitoring system
Streams
Applies analytics techniques of data in motion
Sensors, voice, videos, finantials
- Advantage: Graceful degradation or capability to cope with failures
Two important modules
HDFS: hadoop Distributed file system: Only a chunk of the data lines in each machine, and it could be some
replication too
Then there is the need for a technology for distritributed analysis and aggregation of the results:
MapReduce
It can be used also to clean the data
- The problem in that it is really complex and there are not a lot of experts
- Vendors
Open-source
AWS
Cloudera
MapR
Hornworks
Proprietary
BigInsights (IBM)
Map Reduce
- Developed by Google
- Algorithm components: Map and reduce
Map: Map the broken tasks to the various locations
Reduce: Collect and aggregate results
- Mainly called from Java
- Programming languages have been developed. Here are some examples
Pig: High-level programming language for Hadoop
Pig Latin is the language itself
Then there is the runtime version where the language is coded and executed
Steps of a Pig program
Load the data
Series of manipulations to convert the data into a series of snapper and reducer tasks
Dumping the data to screens or storage
Advantage: Enables to fours more in data analysis than in programming
Hive: Provides SQL-like languages to make queries
Has a shorter learning curve
But it is slower
Not appropriate to write very large programs
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 5 of 22
Jaql
It is a query language for JavaScript Object Notational,but it is more powerful
Enables JOIN, GROUP and FILTER
It is like an Hybrid of PIG and HIVE
Oozie
Streamlines the workflow and coordination among MapReduce tasks
Define jobs, define relationships between the jobs
Schedule execution
Lucene
- used for Text/Analytic searches
- Full text indexing and library search for use within a Java application
BDA Methodology
Stage 1: Concept design
Establish need.
Define problem
Why is this project important?
Stage 2: Proposal
Abstract: Overall methodology and implementation process
Introduction
- What problem?
- Why is it important?
- Why Big data approach?
Completeness: Is the concept design complete?
Correctness: technically sound?
Consistency: cohesive or choppy?
Communicability: Format?, understandable?
Background material
- Problem domain discussion
- Prior projects and research
Stage 3: Methodology
Hypothesis development
Data sources and collection
Variable selection
ETL and data transformation
Platform
Analytic techniques
Expected results
Policy implications
Scope and limitations
Future research
Implementation
- Develop conceptual architecture
Show and describe component
Show and describe big data analytics tools
- Execute steps in methodology
- Import data
- Perform analytics.the various tools
Word Count
Association
Classification
Clustering
- Gain insight from outputs
- Draw conclusions
- Derive policy implications
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 6 of 22
- Make informed decisions
Stage 4: Presentation and evaluation
Robustness
Variety of insight
Substantives of research question
Demonstration of big data analytics application
Some degree of integration among components
Sophistication and complexity of analysis
Examples
BDA in Healthcare: Two broad applications
Healthcare business and delivery side
- Veteran's administration examples
Healthcare information technology (HIT)
Electronic medical records (EMR)
- Improve quality and lower costs
Great potential in the Practice of medicine
- Evidence based
Kaiser Permanente in CA helped retire a bad drug Vioxx
National institute of health in the UK user big data to imestigate drugs
- Personalized medicine: Aid in diagnostic and treatment
Reduction in medical errors
Outcomes
BDA of cancer blogs
Project to extract data from blogs
- Objectives
To use hadoop and MapReduce
To develop a paving algorithm
To develop a vocabulary and taxonomy of keywords
To build a prototype interface
To contribute to social media analysis
- Types of unstructured information
Blog topic
Disease treatment
Other information
What can we learn from blog postings?
- What are the most common issues
- cancer types more discussed
- Therapies and treatments
- Which blog and bloggers are relevant and correct?
- Major motivators for comments
- Emerging trends in symptoms, treatment, therapy
What are the phases and milestones?
- Phase 1: Collection of blog postings into a Derby application
- Phase 2: Configuration of the architecture
keywords
Associations
correlations
clusters
Taxonomy
- Phase 3: Analysis of the extracted information (for example, identify patterns)
- Phase 4: Development of taxonomy
- Phase 5: To test the mining model and develop user interface for deployment

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 7 of 22
4. Data mining methods and the rise of big data (Wayne Thompson)
Big Data
Exponential growth, availability and use of information
The goal of data mining is generalization
Example of application = telematics: Transferring data from anything like a machine, a vehicle, an appliance,
etc .
Data mining methods
Classical data mining techniques
Statistical methods (Not strictly data mining)
- Proposes a model that may explain the relationship between outcome of-interest (dependent variable) and
explanatory (independent) variables
- Multiple linear regression
Used for predicting
Supervised learning techniques: enabled to-identify if a net of input variables is useful for prediction.
- Logistic regression
A form of regression analysis in which the target (response) variable is categorical
K - means clustering
- Methodology for modeling
Select K observations
Assign each observation to the duster with the nearest mean
Recalculate the positions of the centroids
Repeat 2 and 3 until the centroids no longer change
- Find K partitions in the data in which each observation belongs to the cluster within the nearest mean
Association analysis
- Identifies group of products or services that tend to be purchased at the same time or at different times by the
same customer
- Phase of data mining = descriptive modeling
Decision trees
- Segmentation of the data that is created applying a series of simple rules
- Sometimes the trees become very large
Machine learning: Focus on automation
Neural networks
- Mimic human brain
- Very complex
- Black boxes
Support vector machines
- Finding an hyperplane that best splits target values
- Good for classifying simple and complex models
Ensemble models
- Use multiple modeling methods such an neural networks or decision trees to obtain separate models to the
same training data test
Model comparison
- Important: use data sources for training, validation and test data

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 8 of 22
Text analytics
Knowledge engineering =write rules based on syntax and grammar to extract the information
- Accessing the documents: TXT, PDF, CSV, Etc.
Learning from the collection: Text mining
- From text to data set
Use parts of speech
Result: a quantitative representation of the text
Dimension Reduction with singular value decomposition:to whine and simplify the ray large data set
- Exploratory models
Clustering: separate the documents into mutually exclusive sets
Topic modeling
Allow documents to be associated with many topics
Other approach: assume trot the topics were generated with a process, and try to find that process's rules
- Predictive models
Once the text has been transformed into a data set, any approach can be taken
Also, text-specific methodologies can be applied, such as Inductive rule learning algorithm
Digging deeper with natural language: Knowledge engineering
- Used to extract important information from the documents
Combining approaches in the best Strategy: transform the list into a data set and also extract information

5. Data management and the model creation process of structured data for mining and
analytics (Stephan Kudyba)
Things to do before mining. Essential steps to managing data resources for mining and analytics (Think first
and analyze later)
Problem definition
Identifies the type of analytic project to be undertaken
- OLAP or basic dashboard
- Statistical/investigative analysis
- Multivariable data mining
Identifies data sufficiency issues
- Does essential data exist at the required amounts and formats to provide a solution?
Identify your Performance Metric
- Attain greater understanding of some activity at an organization
- The business problem will be a performance metric
- The key and most challenging part is to select the target variable
Formatting your data to match your analytic focus (This can take up to 80% of the time)
Acquire and analyze your data source
- Amounts of data
Is there enough data?
When working with categorical data, each category is actually an addition of another driver variable to a model.
Alleviate the issue of too much data
Take random samples
Use technologies to handle big volumes of data
Define the problem
- Cleansing your data from errors
Use software to visualize outliers
- Transformation of variables to provide more robust analysis
Make the variable relative, for example transform roles into percentage of market share
The transformation usually comes after the near final data resource is to be mined

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 9 of 22
As sophisticated an mining methods may seem, they remain techniques that depend on their sources.So,
we need a methodology for managing data resources
- Avoid comparing apples to oranges.The data must be
Relevant
Available
Accurate
In a format that can be mined
- Factors must be deliberated with analysts, data experts and clients
- Qualify what kind of project is this
Selecting driver variables that provide explanatory information
Selection of variables that can impact the target or performance metric
- With the emergency of new big data variables, comes the complexity of identifying the data sources and
strategies for merging with existing data
- Example of big data influence: GPS information can help insures predict areas with more accidents
After identifying the target variable, define the level of detail of the data to be analyzed
Also need to consider the way the analysis will be structured
It's time to mine
Use available statistical metrics to interpret the validity and output of the analysis
Keep in mind working with the results rather than massaging the data
Interpreting the results (an input for decision making)
Remember that mining is done to provide actionable information for decision-makers
Do the model results produce reliable patterns?
Do the results make sense?
Adjusting the mining
Communicating the results to achieve true decision support
Effective transfer of information in when the receiver of the resource fully comprehends its meaning
The process is not finished yet (monitor the results in the real world)
Business Intelligence to the rescue
Create OLAP cubes to slice the data
Keep ongoing information about the Key Performance Indicators
Search for poor modeling techniques
Additional methods to achieve quality models
Advertising effectiveness and product type
- Example of solution: Normalize the data, add per-capita indicators
- Issue: To measure the effects over time
Summing up the factors to success (not just analysis)
- Incorporate a subject matter expert and a data professional

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 10 of 22
6. The Internet: A Source of new data for mining in marketing (Robert Young)
Marketing landscape and Internet's position
Share of Ad Spen/Share of time
Internet is the record most widely used marketing channel
By 2015, tv will no-longer be #1
70% of Americans use internet weekly
(DIAGRAM)Paid, owned, earned, shared
Increases in paid media impressions result in increased traffic to owned channels
Labor costs decrease
A new language for measuring is emerging
Returns and rates
Machine metrics combined with offline research can provide accurate models of consumption
Media mix dynamics
Click ROI: 5 Machine metrics
1. Search query volume
2. Ad server count
3. Capture the clicking as Part of the website analytics
4. Procrastinating clickers: Those that visit the site days after being exposed to the ad
5. Attribution modeling: Uses algorithms to track the online history
Nonclick ROI (media modeling)
Also called market mix modeling
Apply statistical analysis to historical time series data to unearth relationships between driven variables (ads)
and target variables (sales)
Media modeling and Internet's new role
Internet response functions
Online channels are more popular but less effective over time
Sometimes mixing channels such as internet + tv, is better than each channel separately
Internet source of open data
Google Insight on Google Trends are took to analyze the volume of searches
List of websites to anaIyze the Internet
Technorati: to analyze blogs
Alexa: Rankings
thedatahub.org
getthedata.org
Online's direct and indirect drive
The difficulty now is mixing online and offline channels

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 11 of 22
7. Mining and analytics in E-Commerce (Stephan Kudyba)
The evolving world of e-commerce
E- Commerce is wider than web activities
Product likes in social media
Developing apps
Maintain active Facebook accounts
Sending texts
The analyst must decide what level of analytics is necessary,
Basic email reports
Number of email sent
Emails delivered
Number of emails bounced
Analyzing Web metrics
Page Views
Time spent on page
Navigation route with a website
Click rates
Report generation
Interactive cubes are great
Google analytics comes in handy
From retrospective analytics to more prospective analytics with data mining
Defining E- commerce models for mining
The model can include offline strategies
Better understanding e-mail tactics
Try to not become another spammer
Analyze different variables to determine their effectiveness
Data resources from social media
The"likes" are a new metric
Mobile apps: An organization's dream for Customer Relationships
Apps are growing exponentially
Mining help determining effectiveness of an app.
Movie genre, move and online Endeavor
Factors to consider in Web mining
Optimize the front-end web experience
Beginning the mining process
Analytics much look for repeatable patterns
Evolution in Web marketing
Mining provider insights on identifying visitors
The customer experience
the future entails Knowing your customer better

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 12 of 22
8. Streaming data in the age of Big Data (Billie Anderson and Michael Hardin)
Introduction
Data collection is everywhere
Cell phones
Internet
Social media
The capacity to make decisions with immediacy is key to the operation of the most successful companies
Amazon is a good example for retail success
Other areas an non-retail businesses can be
- Find security breaches
- Disease outbreak
- Service warning
- Detect credit fraud
Big data comes in multiple types
- Traditional: customer transactional information
- Computer generated (weblogs, website navigation)
- Machine-generated data
- Social media data
- Multimedia data (video, voice, etc)
The big data cannot be stored and analyzed in the traditional way.
- An analyst must be able to explore all the types of rain, unstructured data
- This chapter will explain why streaming data needs "special" treatment
- Examples given
Healthcare
Marketing
- Examination of possible data applications for streaming
Streaming Data
The demand for analyzing steaming data comes from business makers and from people who want to improve
lives
For example, a study in Toronto uses big data streaming to detect infections 24 hours Sooner than in the
past (Zikopoolus, P., Eaton, C., deRoos, D., Deutsch, T, and Lapis, G.2012. Understanding big data:
Analytics for enterprise class Hadoop and streaming data. N.Y. McGraw Hill)
In the past, enterprise data was stored in Data Warehouse systems
Problem: with the velocity of modern data, traditional analysis approaches are no longer appropriate
DW is more expensive because it uses disk space to store the data
In a streaming data app, the data flows quickly en the form of text, video or values
Data can come in trillions of bytes per second
Difference with DW: Data is stared on memory for as long as processing the data is needed
Differences between DW and DSMS (DIAGRAMS)
Differences
Diffs
Professionals required
Statistician
Domain expert
Computer analyst
Example of a methodology for developing a real-time analysis system
Los Alamos National Laboratory
Radio astronomy
- Goal: mitigation of noise or Radio Frequency Interference
Interesting because that way the telescope can concentrate in really interesting radio signals and sax a tot of space (and
money) by ignoring the man-made signals)
RFI = Man-made radio signals
- The computational details are only a tiny part of the overall system. Network band with and storage are key
Streaming data Case Studies
Healthcare: Influenza surveillance tracking method
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 13 of 22
Traditional
- Patient admitted
- Patient monitored while in the hospital
More contemporary approach:
- links Google researches to healthcare officials from the Centers for Disease Control using Internet searches to
monitor what part of the population has influenza
- A group of researchers in Hong Kong has developed a digital dashboard to monitor influenza outbreaks
Marketing
Pay Per Click (PPC) Pay only if she advertisement is clicked
- First created by Goto.com
- Then transformed into overture
- Then purchased by Yahoo
- Google has its own
The ads bid for which one will appear at the beginning
If an ad with a low bid gets more dicks, it gets a higher rank,
Detecting dick fraud is crucial
- How to determine whether a click is fake or not?
- With big data the time window could be analyzed more precisely: It was an algorithm called "The Bloom Fitter"
to determine if a click is a duplicate over a jumping window of time
Credit card fraud detection
Costs: $8.6 billion annually
Method to detect fraud: CLUSTERING
- Unsupervised data mining technique:There in no outcome or target variable that is modeled using predictor
variables
- Isolates transactions into homogeneous groups
- then keeps finding the centers for the groups in order to identify the outtiers
- Tasoulis has created an algorithm to generate clusters from streaming data.
Main idea: cluster transactions in moving time windows
Includes a "forgetting factor" to disregard data when it is no longer useful
Two great potentials
Uncover fraud on soon as it occurs
Create a data stream analysis that is efficient
Streaming data is like a river
Data flows in and data flows out
The analyst only gets to see the data one time
Vendor products for streaming big data
Streaming data types are HIGHLY UNSTRUCTURED
Text logs
Twitter feeds
Transactional flows
Click streams
SAS
Offers end-to-end bossiness solutions to 15 specific industries
- Casinos
- Education
- healthcare
- hotels
- Insurance
- Life sciences
- Manufacturing
- Retail
- utilities

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 14 of 22
December 2012: SAS launched the SAS Data Flux Event Stream Processing Engine
- Incorporate, relational, procedural and pattern matching Analysis of unstructured and structured data
- Ideal for analyzing high - velocity big data in real time
Targeted for the financial field
- Global bank, must comply with Basel III, Dodd Frank and other regulations
- Includes data streamed from Thomson Reuters, NYSE technologies and international data corp
IBM
Currently on its third version of a streaming data platform
INFOSPHERE STREAMS
- High performance computing platform
- Allows to develop and reuse applications to rapidly Consume and analyze information and enable users to
Continuously analyze peta-bytes per day
Perform complex analytics of heterogeneous data types
Text
Images
Audio
Voice
Video
Web traffic
e -mail
GPS
Financial transaction data
satellite data
Sensors
Leverage sub-millisecond latencies to react to unfolding events
Adapt to rapidly changing data forms and types
Easily visualize
Conclusion
Streaming big data adds a new component to data security
The number of individuals with access to streaming data is another concern
This is a rich area for ongoing research

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 15 of 22
9. Using CEP for real time data mining (Steven Barber)
Introduction: Complex Event Processing (CEP)
New category of software
Prioritizes a stream of events, performs computations and produces a stream of new events based on the
results of the computations
CEP squeezes out what is interesting from a raw event, produces a new event and discards the garbage
It is also used with static data, mostly for testing
Often used to match already known patterns against arriving data on then act on the Match
It is similar to other classic programming, except that it takes in very large numbers of events that are
relatively raw
Quantitative approaches to streaming Data analysis
The model is declarative rather than imperative
Data-flow oriented rather than control-flow oriented
Push oriented rather than pull oriented
It is the arrival of an event what triggers the CEP processing
CEP are very responsive with minimal processing latency
Rather than loops and if-then-else, the operators are event and stream-oriented
Aggregate, filter, map and split
Rather than global variables, state is kept in relational tables and retrieved via streaming joins of arriving
events with the contents of the table
Streams are partitioned into windows over which operations are performed
Higher level of abstraction-Does not deal with lower-level details
Event stream processing Evolution in recent history
There is always more data and it isn't getting any simpler
Recognizing which event is interesting and which is not is something that MUST be automated
Sometimes we have millions of events per second and must respond to them immediately
Processing model Evolution
The orthodox history of CEP: long needed switch from RDBMS based real time processing to Event stream
processing
- CEP has grown out of the relational database world
- The result code tends to look like SQL
- RDBMS: Store, then analyze
That has limitations when data arrived millions of messages per second
But not every event needs to be stored by every process that receives it
Perhaps we don't even need to remember the event after using it
- CEP analyzes and then stores (if needed) or analyzes while Storing (Inbound processing model).
- Turning the database on its side
In RDBMs data stay and queries are transient
In CEP data is transient and queries are persistent
Another view: CEP is a message transformation engine That's a service attached to a distributed
messaging bus
- Imagine CEP as if it were stream-oriented asynchronous computation attached to a distributed messaging bus.
- Events go in, get operated upon and events go out
- CEP is Analogous to receive a call back in an API
The wheel turns again: Big Data, web applications,and log file analysis
- The problem with log files is that they are slow to create and store
- Store (in log files) then analyze (with, for example, Hadoop)
There is another tool called flume
But this approach is being replaced with a newer system such as Twitter storm and Apache Kafka
Rise of inexpensive multicar processors and GPU based processing
Advantages of CEP
Visual programming for communication with stakeholders
Visualize CEP as program flow diagrams
Emphasizes that events came in streams
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 16 of 22
Benefits
- The developer is able to plot out the flow of each event and Keep the flow of each component Visible
- The meaning of a visual flow-based presentation is clearer to stakeholders
Flip side: Potential for creating very confusing diagrams
CEP Example A simple application using Stream Base
Event Flow Concepts and terminology
This is the language used by Visual programmers
It is entirely visual, no text, with arrows and boxes
Event =A denote representation that something happened
- In streamBase the events are represented in a data type called "tuple"
- Individual tuples are instances of a schema
- A tuple contains a single value for each field in its schema
Schema: has a name and a set of schema fields
- Schema field = name + data type
- For example. a field called quantity has a type of INT, A tuple would have a field called quantity and a value of
100
An Event Flow application file contains a set of components and arcs
- Input streams
- Output streams
- Operators
Perform Specific type of runtime action on the tuples
Operators have editable properties
Stream =a potentially - infinite set of tuples
Sequent Strict ordering of events in the stream
- Input streams
- Output streams
- Arcs
Step by step through the MarketFeed Monitor App
Listen, to a stream of securities market data, calculates throughput Statistics, and publishes them to a
statistics stream
Also detects drastic fall-offs and sends alerts
The input data is simple and the output data is more complex
Uses of CEP in industry and applications
Automated securities trading
Market Data Management
- NYSE:Data structure and normalization vendor
- Onetick-Tick vendor
Signal Generation / Alpha Seeking algorithms
- Low and high level
- Needed to better follow the market
Execution algorithms
- Algorithm manager
- Order manager
- Risk and compliance manager
- Audit system
- Connection to upstream and downstream orders
Smart order routing
- Widely implemented
- Most common asset routed is equities
- Analyze real time market conditions
Real-time profit and loss
Transaction cost analysis

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 17 of 22
CEP in other industries
Intelligence and security
- Sense and respond to patterns that indicate threats
- Details on how the government is leveraging this
Intelligence and surveillance
Intrusion detection and network monitoring
Battlefield command and control
Multiplayer online gaming
- Interest management
- Real-time metrics
- Managing user-generated content
- Website monitoring
Retail and E-commerce transaction analysis
- In-store promotion
- Web site monitoring
- Fraud detection
Network and software systems monitoring
Bandwidth and quality of service monitoring
Effects of CEP
Decision making
- Proactive rather than reactive, shorter event time frames, immediate feedback
- The key in to be able to search and present the data as it arrives
Strategy
- From observing and discovering patterns to automating actions based on pattern
- Datawatch's Panopticon is a tool to visualize patterns
Operational processes
- Reduced lime to market due to productivity increases
- Shortened development means ability to try out more new ideas
Summary
CEP provides real-time analytics for streaming data
CEP architectures typically do not keep the data
StreamBase is a trademark of TIBCO
10. Transforming unstructured data into useful information (Meta S. Brown)
Introduction: Unstructured data
"Unstructured" data is not really unstructured: It is not presented in the simpler form the analyst is used to
The dividing line between nature and data is the computer
Test analytics in a big data culture
Driving forces for text analytics
The majority of investors in text analytics have not seen any ROI yet
Application for text analytics: Analysis of open-ended questions
Difficulties in conducting valuable text analytics
Critics point out that the results are not perfect
Deriving meaning from language is not simple and often requires knowing the context
The goal of text analysis: Deriving structure from unstructured data
There are different definitions for the term "text analytics"
The process of analyzing unstructured text, extracting relevant information and then transforming that
information into structured information that can be leveraged in several ways
Analysis of data contained in natural language text. Application of tat mining techniques to solve business
problems
The process of deriving information from text sources
Conversion of unstructured text into structured data
The transformation process -in Concept:Two major classes of text analytic methods
Statistical: Grounded in mathematics
Linguistic: uses the rules of a specific language
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 18 of 22
What can go wrong?
No ROI!
Imperfect results
Techniques and methodologies
- Entity extraction
Identification and labeling of specific structures such as names, places, or dates
The text is then surrounded by tags similar to html tugs to identify pieces that are interesting
- Autoclassification
Mechanized used primarily to facilitate identification and retrieval of documents
Uses a hierarchy or taxonomy of topics
Specific type: Sentiment analysis
The categories reflect varying attitudes
Frequently applied to social media
Very inexact and frustrating
- Search
Automated process far on-demand retrieval of documents that are relevant to topics of' interest
Documents are quickly indexed
Contents of scanned documents are organized into databases
Integrating unstructured data into predictive analytics: The most important part in to clarify the questions that
will be asked to the text
Assessing the value of new information
What to do after the unstructured text has been transformed into structured data?
Use categorical and continuous variables for predictive modeling
New variables are created to use in modeling
The only new techniques needed are those who translate text into data structures
Using information to direct action
The information means something if and only if it is used
Profitable text analytics begin with a clear business case and well- defined reasonable models
Examples of applications
- Application: Recommendation engines
Inputs: past transactions and demographics
For example:newmendotious received when shopping online
- Application: Churn modeling
Used to identify customers at high risk of going with competitors
Try to identify unhappy customers using sentiment analysis
- Application:Chat monitoring
Surveillance for profanity or inappropriate activity
Alternative: live monitors but those are really expensive
Summary
Text analytics is a young and imperfect field
It is the hottest area of development in analytics and will remain active for decades to come
11. Mining big textual data (Loannis Korkontzelos)
Introduction
The common bottleneck among all big data analysis methods is structuring
Most of the data available are not ready for direct processing, because they are associated with limited or
no metadata
Metadata: Information about the data
This article deals with textual data Sources
Overview of methods for structuring data
Examples and applications
Sources of unstructured textual data
Textual data sources aspects of classification (Properties orthogonal to each other)
Domain: Represents the degree that specialized and technical vocabulary is used in it
- The senses of words depend on it
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 19 of 22
- A piece of text can belong to a specialized technical or scientific domain, or to the general domain
Language
- One of the very full hard classification general features of text
- Partition a collection of items into not overlapping item sets
Style
- From formal and scientific to colloquial and abbreviate
- Due to social media a new style of elliptical Very condensed text has emerged
Patents
Agreements between government and the creator of an invention
Steps to structure a collection of patents
- Identify part of the documents that correspond to different aspects
- Look for the different domains in the text
- Recognize and separate tables and figures
Publications
Easy to match with any domain and application
Natural level of universal structuring
- Title
- Abstract
- Sections
Methods
Result
Conclusion
Contain keywords
Corporate WebPages
Difficult to mine
But companies are still interested in mining the competition
Blogs
Growing interest on ana|yzing
Easier to access
Social media
Challenges: Determine the domain
Identifying the style is difficult
Impractical due to immense size
News
Easy to process
Carefully written
Domains easily specified
Online Forums and Discussion Groups
Strictly domain-specific text
Come with their own search tools
Technical specifications documents
Lengthily domain-specific documents
Formal style
Excellent source of terminology
Newsgroups, Mailing lists, emails
Supported by an specialized Network News Transfer Protocol (NNTP)
Domain-specific contents
Emails are a valuable source of text
Legal documentation
Minutes of public meetings are digitalized and indexed
Important pen text processing as a large, domain-specific textual source for tasks such as
- Topic recognition
- Extracting legal terminology
If it comes with parallel translation, it is an excellent source of multilingual data

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 20 of 22
Wikipedia
Covers many technical domains
Has a search engine
Anyone can modify the contents and that's not always good.
Structuring textual data
Term recognition
Words or sequences of words that represent concept of a specific domain
Useful for many things
- Recognizing neologisms
- Indexing using terms improves search performance
Approaches
- Linguistic
Morphological
Gramatical
Syntactical
- Dictionary based: Employ already available repositories
- Statistical: Application of various statistical tools with receive counts, frequencies,Co-occurrences
- Hybrid
Example: Combination of parts of speech and filtering and then apply reg-exp to the extruded terms
Named entity recognition
Terms associated with a specific class of concepts
Very similar to the notion of terms,but they must be classified and terms don't
Use ontology as their background knowledge: Domain - specific classifications of concepts in Classes
Relation extraction
Relations between previously recognized named entities and methods for recognition
Is-a relation: "car is-a vehicle"
Interacts-with
Event extraction
Events are complex interactions of named entities
Different types and nature for different textual domains
- In the domain of "news" events are incidents or happenings
- In the domain of "biology" events are structured descriptions of biological processes
Can follow simpler or sophisticated methods
- Simpler : Recognize trigger words
- More sophisticated: Introduce iterations
Sentiment analysis
The task of assigning scores to textual pieces that represent the attitude of the writer
Can consider multiple scopes of sentimental dimension
Applications
Web Analytics via Text Analysis in Blogs and social Media
Important tool for market research, measuring effects of ads, etc,
Methods: on-site and offsite
Sentiment analysis is the most used tool
Linking diverse resources
Clustering all articles of the same topic
Retrieve blog and social media posts related to each cluster
Analyzing opinion and sentiments in these posts
Aggregating sentiment and opinion mining outcomes
Search via semantic metadata
Dictionary and Ontology enrichment
Automatic translation
Forensics and profiling
Automatic text summarization

Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 21 of 22
12. The new medical frontier: Real time wireless medical data acquisition for 21st-Century
Healthcare and data mining challenges (David Lubliner and Stephan Kudyba)
Introduction
Wireless medical devices is a growing field
The challenge is extracting meaning out of this torrent of information
Potential benefits: transform healthcare from reactive to proactive
Evolution of modern medicine: Background
Medicine is ancient and reflects our view of the world
Modern medicine began in 1800 with anesthesia, germs, antiseptics, bacteriology, Bernard and Florence
Nightingale
Technology continues to advance and make results amiable much faster than before
Medical data standards: Ensure that all parties are using the same nomenclature
HIPPAA Established guidelines for medical data and security of that information
ICD-10 created by WHO
Data Acquisition: Medical sensors and body scanners
Sensors
Electrical (EKG,EEG)
Measures the cardiac muscle
Voltage is measured in millivolts
Pulse oxymetry: To monitor oxygen saturation of a patient's hemoglobin
Medical scanners
Magnetic Resonance imagining US. Computer Tomography (CT): The magnet is the largest and more
expensive component
Position Emission Tomography (PET)
- Nuclear medicine medical imaging technique
- Work, by injecting a short-lived radioactive substance in the bloodstream
- Much poorer quality than a CT
Computed Tomography (CT): Digital geometry is used to generate a 3D - image
DCOM: Digital imaging: Format used for staring and sharing information between systems in medical
imaging
Imaging Informatics: Acquisition, storage, knowledge base, modeling, and expert systems involved with the
medical imaging field
Wireless medical devices
Can contain local data storage for subsequent download or transmit
can be standalone or within a network
Bluetooth Wireless Communications: Short range and well defined security protocols
Body sensor networks (BNSs)
A series of medical devices worn by patients
Wlireless Medical Device Protected Spectrum: 2360 to 2400 MHTZ: To prevent interferences
Integrated Data Capture Modeling for Wireless Medical Devices: Data generation is evolving
Expert system utilized to evaluate medical data
A knowledge base to make queries upon
There is a trend to use "Crowd wisdom" that consists on downloading big amount of data on the hope that
more data is more correct
Data mining and Big Data
With millions of device Sending real-time data, the tracking of infection, could be easier
The process of finding previously unknown patterns
Genomic mapping of large data Sets
It will be possible to provide treatment in the womb
The cost has decreased from $100M to $5K and from years to hours
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p

Prepared by Ariadna73 Page 22 of 22
Future directions: Mining large data sets: NSF and NIH research initiatives
Government agencies that support the analysis of medical data
National Science Foundation (NSF)
Department of defense (DOD)
National Institutes of health (NIH)
Some large data sets initiatives
Earth Cube: a system to Share info about the planet
Defense Advance research Projects Agency (DARPA)
Smart Health and Well-Being program (SHB)
Other evolving mining and analytics applications in healthcare
Workflow Analytics of Provider Organizations
Metrics to measure performance
LOS, Patient satisfaction, utilization
Staffing
Patient episode cost estimation
ER throughput
Estimating services demand
Risk Stratification
Combining structured and unstructured data for Patient diagnosis and treatment outcomes
Avoid adverse drug events
Optimize diet
Consider psychological effects of treatments

You might also like