Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC
Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 1 of 22 Foreword by Thomas Davenport Big Data gives competitive advantage Appeal of Big Data: Combining diverse data types and formats. Marketing creation and analytics is a big part of Big Data Huge advancer in Healthcare at lower costs Big Data is being combined with traditional forms of analytics With this book the reader will learn about all the technologies needed to establish a platform for Processing Big Data 1. Introduction to Big data era (Stephan Kudyba and Mathew Kwatinetz) Description of Big Data Industries that use BD Marketing - Elections: Using with media to campaign - Investment and with media - Commerce and loyalty data Healthcare - Predict the spread of disease - Descriptive power and predictive pattern matching Real estate - Building energy disclosures - Smart meters Transportation (GPS) - Intelligent transport application - Crowd-source crime fighting: apps to upload crime-related information - Pedestrian traffic patterns in retail Energy (metrics) Retail - Estimate sales - Forecast the price of online tickets Sensors in various products - Cell phones - Fitness devices - Even appliances! Sources are numerous Structured Unstructured - website links - Product reviews - Text - Pictures/Images - Twits - Emails Velocities (How quickly is data being generated, communicated and stored) Real time: high-velocity or fast moving data Source of more descriptive variables Building blocks to decision support Unless data can help decision makers, there is little value to it Leveraging data through analytics: The most important thing to know is which questions is BD. going to answer Impact trend: ability to leverage data variables that describe activities/ processes The value of data: BD, enhances the decision - making process Value is found after normalizing, calculated or categorized Ethical considerations in the Big Data era: Users must be aware at all timer of their data being collected, Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 2 of 22 2. Information creation through analytics (Stephan Kudyba) End result of Analytics (Basic concept): Extract / generate information to provide a resource to enhance the decision-making process Business Intelligence Query and report creating - Spreadsheets Online Analytic Processing - Provide multidimensional view of an activity - Also considers the source of the data as a variable - Pivot tables: Leverage data in a flat file to present alternative scenario Analytics at a glance through dashboards: Be careful and don't make them overwhelming Multivariate analysis - Robust BI and drill down behind dashboard views - Regression - Data mining apps Neural networks Clustering Segmentation classification Real time mining Analysis of unstructured data: Text mining Six sigma Focus: Reduce variability in operations Visualization Simple graphics Types Real time mining and big data Analysis of unstructured data and combining Structured and unstructured sources Complex Event Processing (CEP) Event Stream Processing (ESP) Data mining and the value of data The first question to ask is what is in a data file? Why things are happening Correlations Fraud detention Risk assessment Outcomes and other areas of healthcare What is likely to happen Value of data and analytics Efficiency Cost reduction Productivity Profitability
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 3 of 22 3. Big Data Analytics (BDA) -Architectures, implementation, methodologies and tools (Wullianallur Raghupathi and Viju Raghupathi) BDA in emerging as a sub-discipline of Business Analytics BDA in being used to analyze and gain insight to make informed decisions Primary characteristics - Volume - Velocity - Variety - Veracity (added by some practitioners) Big data analytics is changing data analytics in companies In the future it will be widespread Key: utilization of distributed processing and new gathering and storing tools Very large data sets have existed forever, but now we have better tools to store them Also, social media Sources are new This approach is un precedent Challenges Typical limitations of Open Source Lag between when the data is collected and processed Web services delivery mechanisms need more development Privacy, Safeguarding security, establishing Standards and governance Architectures, Frameworks, and tools The conceptual framework en similar to other business Analytics project with some differences 1. Processing: with such a big data set, the procesing must be broken down in multiple systems 2. Proliferation of open-source platforms such as Hadoop/MapReduce encourages the use of multiple domains 3. The user interfaces are entirely different Applied conceptual architecture BD needs complex tools to clean the data and then apply any of those fancy software apps to its four typical applications: OLAP, Reporting, Queries, and Data Mining Hadoop (most popular) - It is a NoSQL type of tool developed in Apache Allocates portions of the data in different servers to be processed, and then integrates the results Has stimulated other Apache developments such as Zookeeper: coordination services Open source that allows a centralized infrastructure with various services Uses Java C interfaces Hbase Column-oriented database management system that sits on top of HDFS (the Hadoop file system) Name Node (Master): Manages the cluster in Hbase Slave nodes JobTracker TaskTracker Does not support SQL Developed in Java. Supports Avro, REST, Thrift Cassandra and A-Base: Databases A distributed database system Top-level project 2 million columns in a single row Built on a distributed architecture named Dynamo Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 4 of 22 Does not have a master mode so it can be more easily queried Apache Avro: Data sterilization Facilitates data serialization services Includes the schema in the data file Supports versioning Mahout: Machine-learning library Chukka: Monitoring system Streams Applies analytics techniques of data in motion Sensors, voice, videos, finantials - Advantage: Graceful degradation or capability to cope with failures Two important modules HDFS: hadoop Distributed file system: Only a chunk of the data lines in each machine, and it could be some replication too Then there is the need for a technology for distritributed analysis and aggregation of the results: MapReduce It can be used also to clean the data - The problem in that it is really complex and there are not a lot of experts - Vendors Open-source AWS Cloudera MapR Hornworks Proprietary BigInsights (IBM) Map Reduce - Developed by Google - Algorithm components: Map and reduce Map: Map the broken tasks to the various locations Reduce: Collect and aggregate results - Mainly called from Java - Programming languages have been developed. Here are some examples Pig: High-level programming language for Hadoop Pig Latin is the language itself Then there is the runtime version where the language is coded and executed Steps of a Pig program Load the data Series of manipulations to convert the data into a series of snapper and reducer tasks Dumping the data to screens or storage Advantage: Enables to fours more in data analysis than in programming Hive: Provides SQL-like languages to make queries Has a shorter learning curve But it is slower Not appropriate to write very large programs Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 5 of 22 Jaql It is a query language for JavaScript Object Notational,but it is more powerful Enables JOIN, GROUP and FILTER It is like an Hybrid of PIG and HIVE Oozie Streamlines the workflow and coordination among MapReduce tasks Define jobs, define relationships between the jobs Schedule execution Lucene - used for Text/Analytic searches - Full text indexing and library search for use within a Java application BDA Methodology Stage 1: Concept design Establish need. Define problem Why is this project important? Stage 2: Proposal Abstract: Overall methodology and implementation process Introduction - What problem? - Why is it important? - Why Big data approach? Completeness: Is the concept design complete? Correctness: technically sound? Consistency: cohesive or choppy? Communicability: Format?, understandable? Background material - Problem domain discussion - Prior projects and research Stage 3: Methodology Hypothesis development Data sources and collection Variable selection ETL and data transformation Platform Analytic techniques Expected results Policy implications Scope and limitations Future research Implementation - Develop conceptual architecture Show and describe component Show and describe big data analytics tools - Execute steps in methodology - Import data - Perform analytics.the various tools Word Count Association Classification Clustering - Gain insight from outputs - Draw conclusions - Derive policy implications Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 6 of 22 - Make informed decisions Stage 4: Presentation and evaluation Robustness Variety of insight Substantives of research question Demonstration of big data analytics application Some degree of integration among components Sophistication and complexity of analysis Examples BDA in Healthcare: Two broad applications Healthcare business and delivery side - Veteran's administration examples Healthcare information technology (HIT) Electronic medical records (EMR) - Improve quality and lower costs Great potential in the Practice of medicine - Evidence based Kaiser Permanente in CA helped retire a bad drug Vioxx National institute of health in the UK user big data to imestigate drugs - Personalized medicine: Aid in diagnostic and treatment Reduction in medical errors Outcomes BDA of cancer blogs Project to extract data from blogs - Objectives To use hadoop and MapReduce To develop a paving algorithm To develop a vocabulary and taxonomy of keywords To build a prototype interface To contribute to social media analysis - Types of unstructured information Blog topic Disease treatment Other information What can we learn from blog postings? - What are the most common issues - cancer types more discussed - Therapies and treatments - Which blog and bloggers are relevant and correct? - Major motivators for comments - Emerging trends in symptoms, treatment, therapy What are the phases and milestones? - Phase 1: Collection of blog postings into a Derby application - Phase 2: Configuration of the architecture keywords Associations correlations clusters Taxonomy - Phase 3: Analysis of the extracted information (for example, identify patterns) - Phase 4: Development of taxonomy - Phase 5: To test the mining model and develop user interface for deployment
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 7 of 22 4. Data mining methods and the rise of big data (Wayne Thompson) Big Data Exponential growth, availability and use of information The goal of data mining is generalization Example of application = telematics: Transferring data from anything like a machine, a vehicle, an appliance, etc . Data mining methods Classical data mining techniques Statistical methods (Not strictly data mining) - Proposes a model that may explain the relationship between outcome of-interest (dependent variable) and explanatory (independent) variables - Multiple linear regression Used for predicting Supervised learning techniques: enabled to-identify if a net of input variables is useful for prediction. - Logistic regression A form of regression analysis in which the target (response) variable is categorical K - means clustering - Methodology for modeling Select K observations Assign each observation to the duster with the nearest mean Recalculate the positions of the centroids Repeat 2 and 3 until the centroids no longer change - Find K partitions in the data in which each observation belongs to the cluster within the nearest mean Association analysis - Identifies group of products or services that tend to be purchased at the same time or at different times by the same customer - Phase of data mining = descriptive modeling Decision trees - Segmentation of the data that is created applying a series of simple rules - Sometimes the trees become very large Machine learning: Focus on automation Neural networks - Mimic human brain - Very complex - Black boxes Support vector machines - Finding an hyperplane that best splits target values - Good for classifying simple and complex models Ensemble models - Use multiple modeling methods such an neural networks or decision trees to obtain separate models to the same training data test Model comparison - Important: use data sources for training, validation and test data
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 8 of 22 Text analytics Knowledge engineering =write rules based on syntax and grammar to extract the information - Accessing the documents: TXT, PDF, CSV, Etc. Learning from the collection: Text mining - From text to data set Use parts of speech Result: a quantitative representation of the text Dimension Reduction with singular value decomposition:to whine and simplify the ray large data set - Exploratory models Clustering: separate the documents into mutually exclusive sets Topic modeling Allow documents to be associated with many topics Other approach: assume trot the topics were generated with a process, and try to find that process's rules - Predictive models Once the text has been transformed into a data set, any approach can be taken Also, text-specific methodologies can be applied, such as Inductive rule learning algorithm Digging deeper with natural language: Knowledge engineering - Used to extract important information from the documents Combining approaches in the best Strategy: transform the list into a data set and also extract information
5. Data management and the model creation process of structured data for mining and analytics (Stephan Kudyba) Things to do before mining. Essential steps to managing data resources for mining and analytics (Think first and analyze later) Problem definition Identifies the type of analytic project to be undertaken - OLAP or basic dashboard - Statistical/investigative analysis - Multivariable data mining Identifies data sufficiency issues - Does essential data exist at the required amounts and formats to provide a solution? Identify your Performance Metric - Attain greater understanding of some activity at an organization - The business problem will be a performance metric - The key and most challenging part is to select the target variable Formatting your data to match your analytic focus (This can take up to 80% of the time) Acquire and analyze your data source - Amounts of data Is there enough data? When working with categorical data, each category is actually an addition of another driver variable to a model. Alleviate the issue of too much data Take random samples Use technologies to handle big volumes of data Define the problem - Cleansing your data from errors Use software to visualize outliers - Transformation of variables to provide more robust analysis Make the variable relative, for example transform roles into percentage of market share The transformation usually comes after the near final data resource is to be mined
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 9 of 22 As sophisticated an mining methods may seem, they remain techniques that depend on their sources.So, we need a methodology for managing data resources - Avoid comparing apples to oranges.The data must be Relevant Available Accurate In a format that can be mined - Factors must be deliberated with analysts, data experts and clients - Qualify what kind of project is this Selecting driver variables that provide explanatory information Selection of variables that can impact the target or performance metric - With the emergency of new big data variables, comes the complexity of identifying the data sources and strategies for merging with existing data - Example of big data influence: GPS information can help insures predict areas with more accidents After identifying the target variable, define the level of detail of the data to be analyzed Also need to consider the way the analysis will be structured It's time to mine Use available statistical metrics to interpret the validity and output of the analysis Keep in mind working with the results rather than massaging the data Interpreting the results (an input for decision making) Remember that mining is done to provide actionable information for decision-makers Do the model results produce reliable patterns? Do the results make sense? Adjusting the mining Communicating the results to achieve true decision support Effective transfer of information in when the receiver of the resource fully comprehends its meaning The process is not finished yet (monitor the results in the real world) Business Intelligence to the rescue Create OLAP cubes to slice the data Keep ongoing information about the Key Performance Indicators Search for poor modeling techniques Additional methods to achieve quality models Advertising effectiveness and product type - Example of solution: Normalize the data, add per-capita indicators - Issue: To measure the effects over time Summing up the factors to success (not just analysis) - Incorporate a subject matter expert and a data professional
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 10 of 22 6. The Internet: A Source of new data for mining in marketing (Robert Young) Marketing landscape and Internet's position Share of Ad Spen/Share of time Internet is the record most widely used marketing channel By 2015, tv will no-longer be #1 70% of Americans use internet weekly (DIAGRAM)Paid, owned, earned, shared Increases in paid media impressions result in increased traffic to owned channels Labor costs decrease A new language for measuring is emerging Returns and rates Machine metrics combined with offline research can provide accurate models of consumption Media mix dynamics Click ROI: 5 Machine metrics 1. Search query volume 2. Ad server count 3. Capture the clicking as Part of the website analytics 4. Procrastinating clickers: Those that visit the site days after being exposed to the ad 5. Attribution modeling: Uses algorithms to track the online history Nonclick ROI (media modeling) Also called market mix modeling Apply statistical analysis to historical time series data to unearth relationships between driven variables (ads) and target variables (sales) Media modeling and Internet's new role Internet response functions Online channels are more popular but less effective over time Sometimes mixing channels such as internet + tv, is better than each channel separately Internet source of open data Google Insight on Google Trends are took to analyze the volume of searches List of websites to anaIyze the Internet Technorati: to analyze blogs Alexa: Rankings thedatahub.org getthedata.org Online's direct and indirect drive The difficulty now is mixing online and offline channels
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 11 of 22 7. Mining and analytics in E-Commerce (Stephan Kudyba) The evolving world of e-commerce E- Commerce is wider than web activities Product likes in social media Developing apps Maintain active Facebook accounts Sending texts The analyst must decide what level of analytics is necessary, Basic email reports Number of email sent Emails delivered Number of emails bounced Analyzing Web metrics Page Views Time spent on page Navigation route with a website Click rates Report generation Interactive cubes are great Google analytics comes in handy From retrospective analytics to more prospective analytics with data mining Defining E- commerce models for mining The model can include offline strategies Better understanding e-mail tactics Try to not become another spammer Analyze different variables to determine their effectiveness Data resources from social media The"likes" are a new metric Mobile apps: An organization's dream for Customer Relationships Apps are growing exponentially Mining help determining effectiveness of an app. Movie genre, move and online Endeavor Factors to consider in Web mining Optimize the front-end web experience Beginning the mining process Analytics much look for repeatable patterns Evolution in Web marketing Mining provider insights on identifying visitors The customer experience the future entails Knowing your customer better
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 12 of 22 8. Streaming data in the age of Big Data (Billie Anderson and Michael Hardin) Introduction Data collection is everywhere Cell phones Internet Social media The capacity to make decisions with immediacy is key to the operation of the most successful companies Amazon is a good example for retail success Other areas an non-retail businesses can be - Find security breaches - Disease outbreak - Service warning - Detect credit fraud Big data comes in multiple types - Traditional: customer transactional information - Computer generated (weblogs, website navigation) - Machine-generated data - Social media data - Multimedia data (video, voice, etc) The big data cannot be stored and analyzed in the traditional way. - An analyst must be able to explore all the types of rain, unstructured data - This chapter will explain why streaming data needs "special" treatment - Examples given Healthcare Marketing - Examination of possible data applications for streaming Streaming Data The demand for analyzing steaming data comes from business makers and from people who want to improve lives For example, a study in Toronto uses big data streaming to detect infections 24 hours Sooner than in the past (Zikopoolus, P., Eaton, C., deRoos, D., Deutsch, T, and Lapis, G.2012. Understanding big data: Analytics for enterprise class Hadoop and streaming data. N.Y. McGraw Hill) In the past, enterprise data was stored in Data Warehouse systems Problem: with the velocity of modern data, traditional analysis approaches are no longer appropriate DW is more expensive because it uses disk space to store the data In a streaming data app, the data flows quickly en the form of text, video or values Data can come in trillions of bytes per second Difference with DW: Data is stared on memory for as long as processing the data is needed Differences between DW and DSMS (DIAGRAMS) Differences Diffs Professionals required Statistician Domain expert Computer analyst Example of a methodology for developing a real-time analysis system Los Alamos National Laboratory Radio astronomy - Goal: mitigation of noise or Radio Frequency Interference Interesting because that way the telescope can concentrate in really interesting radio signals and sax a tot of space (and money) by ignoring the man-made signals) RFI = Man-made radio signals - The computational details are only a tiny part of the overall system. Network band with and storage are key Streaming data Case Studies Healthcare: Influenza surveillance tracking method Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 13 of 22 Traditional - Patient admitted - Patient monitored while in the hospital More contemporary approach: - links Google researches to healthcare officials from the Centers for Disease Control using Internet searches to monitor what part of the population has influenza - A group of researchers in Hong Kong has developed a digital dashboard to monitor influenza outbreaks Marketing Pay Per Click (PPC) Pay only if she advertisement is clicked - First created by Goto.com - Then transformed into overture - Then purchased by Yahoo - Google has its own The ads bid for which one will appear at the beginning If an ad with a low bid gets more dicks, it gets a higher rank, Detecting dick fraud is crucial - How to determine whether a click is fake or not? - With big data the time window could be analyzed more precisely: It was an algorithm called "The Bloom Fitter" to determine if a click is a duplicate over a jumping window of time Credit card fraud detection Costs: $8.6 billion annually Method to detect fraud: CLUSTERING - Unsupervised data mining technique:There in no outcome or target variable that is modeled using predictor variables - Isolates transactions into homogeneous groups - then keeps finding the centers for the groups in order to identify the outtiers - Tasoulis has created an algorithm to generate clusters from streaming data. Main idea: cluster transactions in moving time windows Includes a "forgetting factor" to disregard data when it is no longer useful Two great potentials Uncover fraud on soon as it occurs Create a data stream analysis that is efficient Streaming data is like a river Data flows in and data flows out The analyst only gets to see the data one time Vendor products for streaming big data Streaming data types are HIGHLY UNSTRUCTURED Text logs Twitter feeds Transactional flows Click streams SAS Offers end-to-end bossiness solutions to 15 specific industries - Casinos - Education - healthcare - hotels - Insurance - Life sciences - Manufacturing - Retail - utilities
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 14 of 22 December 2012: SAS launched the SAS Data Flux Event Stream Processing Engine - Incorporate, relational, procedural and pattern matching Analysis of unstructured and structured data - Ideal for analyzing high - velocity big data in real time Targeted for the financial field - Global bank, must comply with Basel III, Dodd Frank and other regulations - Includes data streamed from Thomson Reuters, NYSE technologies and international data corp IBM Currently on its third version of a streaming data platform INFOSPHERE STREAMS - High performance computing platform - Allows to develop and reuse applications to rapidly Consume and analyze information and enable users to Continuously analyze peta-bytes per day Perform complex analytics of heterogeneous data types Text Images Audio Voice Video Web traffic e -mail GPS Financial transaction data satellite data Sensors Leverage sub-millisecond latencies to react to unfolding events Adapt to rapidly changing data forms and types Easily visualize Conclusion Streaming big data adds a new component to data security The number of individuals with access to streaming data is another concern This is a rich area for ongoing research
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 15 of 22 9. Using CEP for real time data mining (Steven Barber) Introduction: Complex Event Processing (CEP) New category of software Prioritizes a stream of events, performs computations and produces a stream of new events based on the results of the computations CEP squeezes out what is interesting from a raw event, produces a new event and discards the garbage It is also used with static data, mostly for testing Often used to match already known patterns against arriving data on then act on the Match It is similar to other classic programming, except that it takes in very large numbers of events that are relatively raw Quantitative approaches to streaming Data analysis The model is declarative rather than imperative Data-flow oriented rather than control-flow oriented Push oriented rather than pull oriented It is the arrival of an event what triggers the CEP processing CEP are very responsive with minimal processing latency Rather than loops and if-then-else, the operators are event and stream-oriented Aggregate, filter, map and split Rather than global variables, state is kept in relational tables and retrieved via streaming joins of arriving events with the contents of the table Streams are partitioned into windows over which operations are performed Higher level of abstraction-Does not deal with lower-level details Event stream processing Evolution in recent history There is always more data and it isn't getting any simpler Recognizing which event is interesting and which is not is something that MUST be automated Sometimes we have millions of events per second and must respond to them immediately Processing model Evolution The orthodox history of CEP: long needed switch from RDBMS based real time processing to Event stream processing - CEP has grown out of the relational database world - The result code tends to look like SQL - RDBMS: Store, then analyze That has limitations when data arrived millions of messages per second But not every event needs to be stored by every process that receives it Perhaps we don't even need to remember the event after using it - CEP analyzes and then stores (if needed) or analyzes while Storing (Inbound processing model). - Turning the database on its side In RDBMs data stay and queries are transient In CEP data is transient and queries are persistent Another view: CEP is a message transformation engine That's a service attached to a distributed messaging bus - Imagine CEP as if it were stream-oriented asynchronous computation attached to a distributed messaging bus. - Events go in, get operated upon and events go out - CEP is Analogous to receive a call back in an API The wheel turns again: Big Data, web applications,and log file analysis - The problem with log files is that they are slow to create and store - Store (in log files) then analyze (with, for example, Hadoop) There is another tool called flume But this approach is being replaced with a newer system such as Twitter storm and Apache Kafka Rise of inexpensive multicar processors and GPU based processing Advantages of CEP Visual programming for communication with stakeholders Visualize CEP as program flow diagrams Emphasizes that events came in streams Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 16 of 22 Benefits - The developer is able to plot out the flow of each event and Keep the flow of each component Visible - The meaning of a visual flow-based presentation is clearer to stakeholders Flip side: Potential for creating very confusing diagrams CEP Example A simple application using Stream Base Event Flow Concepts and terminology This is the language used by Visual programmers It is entirely visual, no text, with arrows and boxes Event =A denote representation that something happened - In streamBase the events are represented in a data type called "tuple" - Individual tuples are instances of a schema - A tuple contains a single value for each field in its schema Schema: has a name and a set of schema fields - Schema field = name + data type - For example. a field called quantity has a type of INT, A tuple would have a field called quantity and a value of 100 An Event Flow application file contains a set of components and arcs - Input streams - Output streams - Operators Perform Specific type of runtime action on the tuples Operators have editable properties Stream =a potentially - infinite set of tuples Sequent Strict ordering of events in the stream - Input streams - Output streams - Arcs Step by step through the MarketFeed Monitor App Listen, to a stream of securities market data, calculates throughput Statistics, and publishes them to a statistics stream Also detects drastic fall-offs and sends alerts The input data is simple and the output data is more complex Uses of CEP in industry and applications Automated securities trading Market Data Management - NYSE:Data structure and normalization vendor - Onetick-Tick vendor Signal Generation / Alpha Seeking algorithms - Low and high level - Needed to better follow the market Execution algorithms - Algorithm manager - Order manager - Risk and compliance manager - Audit system - Connection to upstream and downstream orders Smart order routing - Widely implemented - Most common asset routed is equities - Analyze real time market conditions Real-time profit and loss Transaction cost analysis
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 17 of 22 CEP in other industries Intelligence and security - Sense and respond to patterns that indicate threats - Details on how the government is leveraging this Intelligence and surveillance Intrusion detection and network monitoring Battlefield command and control Multiplayer online gaming - Interest management - Real-time metrics - Managing user-generated content - Website monitoring Retail and E-commerce transaction analysis - In-store promotion - Web site monitoring - Fraud detection Network and software systems monitoring Bandwidth and quality of service monitoring Effects of CEP Decision making - Proactive rather than reactive, shorter event time frames, immediate feedback - The key in to be able to search and present the data as it arrives Strategy - From observing and discovering patterns to automating actions based on pattern - Datawatch's Panopticon is a tool to visualize patterns Operational processes - Reduced lime to market due to productivity increases - Shortened development means ability to try out more new ideas Summary CEP provides real-time analytics for streaming data CEP architectures typically do not keep the data StreamBase is a trademark of TIBCO 10. Transforming unstructured data into useful information (Meta S. Brown) Introduction: Unstructured data "Unstructured" data is not really unstructured: It is not presented in the simpler form the analyst is used to The dividing line between nature and data is the computer Test analytics in a big data culture Driving forces for text analytics The majority of investors in text analytics have not seen any ROI yet Application for text analytics: Analysis of open-ended questions Difficulties in conducting valuable text analytics Critics point out that the results are not perfect Deriving meaning from language is not simple and often requires knowing the context The goal of text analysis: Deriving structure from unstructured data There are different definitions for the term "text analytics" The process of analyzing unstructured text, extracting relevant information and then transforming that information into structured information that can be leveraged in several ways Analysis of data contained in natural language text. Application of tat mining techniques to solve business problems The process of deriving information from text sources Conversion of unstructured text into structured data The transformation process -in Concept:Two major classes of text analytic methods Statistical: Grounded in mathematics Linguistic: uses the rules of a specific language Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 18 of 22 What can go wrong? No ROI! Imperfect results Techniques and methodologies - Entity extraction Identification and labeling of specific structures such as names, places, or dates The text is then surrounded by tags similar to html tugs to identify pieces that are interesting - Autoclassification Mechanized used primarily to facilitate identification and retrieval of documents Uses a hierarchy or taxonomy of topics Specific type: Sentiment analysis The categories reflect varying attitudes Frequently applied to social media Very inexact and frustrating - Search Automated process far on-demand retrieval of documents that are relevant to topics of' interest Documents are quickly indexed Contents of scanned documents are organized into databases Integrating unstructured data into predictive analytics: The most important part in to clarify the questions that will be asked to the text Assessing the value of new information What to do after the unstructured text has been transformed into structured data? Use categorical and continuous variables for predictive modeling New variables are created to use in modeling The only new techniques needed are those who translate text into data structures Using information to direct action The information means something if and only if it is used Profitable text analytics begin with a clear business case and well- defined reasonable models Examples of applications - Application: Recommendation engines Inputs: past transactions and demographics For example:newmendotious received when shopping online - Application: Churn modeling Used to identify customers at high risk of going with competitors Try to identify unhappy customers using sentiment analysis - Application:Chat monitoring Surveillance for profanity or inappropriate activity Alternative: live monitors but those are really expensive Summary Text analytics is a young and imperfect field It is the hottest area of development in analytics and will remain active for decades to come 11. Mining big textual data (Loannis Korkontzelos) Introduction The common bottleneck among all big data analysis methods is structuring Most of the data available are not ready for direct processing, because they are associated with limited or no metadata Metadata: Information about the data This article deals with textual data Sources Overview of methods for structuring data Examples and applications Sources of unstructured textual data Textual data sources aspects of classification (Properties orthogonal to each other) Domain: Represents the degree that specialized and technical vocabulary is used in it - The senses of words depend on it Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 19 of 22 - A piece of text can belong to a specialized technical or scientific domain, or to the general domain Language - One of the very full hard classification general features of text - Partition a collection of items into not overlapping item sets Style - From formal and scientific to colloquial and abbreviate - Due to social media a new style of elliptical Very condensed text has emerged Patents Agreements between government and the creator of an invention Steps to structure a collection of patents - Identify part of the documents that correspond to different aspects - Look for the different domains in the text - Recognize and separate tables and figures Publications Easy to match with any domain and application Natural level of universal structuring - Title - Abstract - Sections Methods Result Conclusion Contain keywords Corporate WebPages Difficult to mine But companies are still interested in mining the competition Blogs Growing interest on ana|yzing Easier to access Social media Challenges: Determine the domain Identifying the style is difficult Impractical due to immense size News Easy to process Carefully written Domains easily specified Online Forums and Discussion Groups Strictly domain-specific text Come with their own search tools Technical specifications documents Lengthily domain-specific documents Formal style Excellent source of terminology Newsgroups, Mailing lists, emails Supported by an specialized Network News Transfer Protocol (NNTP) Domain-specific contents Emails are a valuable source of text Legal documentation Minutes of public meetings are digitalized and indexed Important pen text processing as a large, domain-specific textual source for tasks such as - Topic recognition - Extracting legal terminology If it comes with parallel translation, it is an excellent source of multilingual data
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 20 of 22 Wikipedia Covers many technical domains Has a search engine Anyone can modify the contents and that's not always good. Structuring textual data Term recognition Words or sequences of words that represent concept of a specific domain Useful for many things - Recognizing neologisms - Indexing using terms improves search performance Approaches - Linguistic Morphological Gramatical Syntactical - Dictionary based: Employ already available repositories - Statistical: Application of various statistical tools with receive counts, frequencies,Co-occurrences - Hybrid Example: Combination of parts of speech and filtering and then apply reg-exp to the extruded terms Named entity recognition Terms associated with a specific class of concepts Very similar to the notion of terms,but they must be classified and terms don't Use ontology as their background knowledge: Domain - specific classifications of concepts in Classes Relation extraction Relations between previously recognized named entities and methods for recognition Is-a relation: "car is-a vehicle" Interacts-with Event extraction Events are complex interactions of named entities Different types and nature for different textual domains - In the domain of "news" events are incidents or happenings - In the domain of "biology" events are structured descriptions of biological processes Can follow simpler or sophisticated methods - Simpler : Recognize trigger words - More sophisticated: Introduce iterations Sentiment analysis The task of assigning scores to textual pieces that represent the attitude of the writer Can consider multiple scopes of sentimental dimension Applications Web Analytics via Text Analysis in Blogs and social Media Important tool for market research, measuring effects of ads, etc, Methods: on-site and offsite Sentiment analysis is the most used tool Linking diverse resources Clustering all articles of the same topic Retrieve blog and social media posts related to each cluster Analyzing opinion and sentiments in these posts Aggregating sentiment and opinion mining outcomes Search via semantic metadata Dictionary and Ontology enrichment Automatic translation Forensics and profiling Automatic text summarization
Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 21 of 22 12. The new medical frontier: Real time wireless medical data acquisition for 21st-Century Healthcare and data mining challenges (David Lubliner and Stephan Kudyba) Introduction Wireless medical devices is a growing field The challenge is extracting meaning out of this torrent of information Potential benefits: transform healthcare from reactive to proactive Evolution of modern medicine: Background Medicine is ancient and reflects our view of the world Modern medicine began in 1800 with anesthesia, germs, antiseptics, bacteriology, Bernard and Florence Nightingale Technology continues to advance and make results amiable much faster than before Medical data standards: Ensure that all parties are using the same nomenclature HIPPAA Established guidelines for medical data and security of that information ICD-10 created by WHO Data Acquisition: Medical sensors and body scanners Sensors Electrical (EKG,EEG) Measures the cardiac muscle Voltage is measured in millivolts Pulse oxymetry: To monitor oxygen saturation of a patient's hemoglobin Medical scanners Magnetic Resonance imagining US. Computer Tomography (CT): The magnet is the largest and more expensive component Position Emission Tomography (PET) - Nuclear medicine medical imaging technique - Work, by injecting a short-lived radioactive substance in the bloodstream - Much poorer quality than a CT Computed Tomography (CT): Digital geometry is used to generate a 3D - image DCOM: Digital imaging: Format used for staring and sharing information between systems in medical imaging Imaging Informatics: Acquisition, storage, knowledge base, modeling, and expert systems involved with the medical imaging field Wireless medical devices Can contain local data storage for subsequent download or transmit can be standalone or within a network Bluetooth Wireless Communications: Short range and well defined security protocols Body sensor networks (BNSs) A series of medical devices worn by patients Wlireless Medical Device Protected Spectrum: 2360 to 2400 MHTZ: To prevent interferences Integrated Data Capture Modeling for Wireless Medical Devices: Data generation is evolving Expert system utilized to evaluate medical data A knowledge base to make queries upon There is a trend to use "Crowd wisdom" that consists on downloading big amount of data on the hope that more data is more correct Data mining and Big Data With millions of device Sending real-time data, the tracking of infection, could be easier The process of finding previously unknown patterns Genomic mapping of large data Sets It will be possible to provide treatment in the womb The cost has decreased from $100M to $5K and from years to hours Summary of Kudyba (2014) Big data, mining, and analytics : components of strategic decision making; Boca Raton : CRC Press/Taylor & Francis; 305p
Prepared by Ariadna73 Page 22 of 22 Future directions: Mining large data sets: NSF and NIH research initiatives Government agencies that support the analysis of medical data National Science Foundation (NSF) Department of defense (DOD) National Institutes of health (NIH) Some large data sets initiatives Earth Cube: a system to Share info about the planet Defense Advance research Projects Agency (DARPA) Smart Health and Well-Being program (SHB) Other evolving mining and analytics applications in healthcare Workflow Analytics of Provider Organizations Metrics to measure performance LOS, Patient satisfaction, utilization Staffing Patient episode cost estimation ER throughput Estimating services demand Risk Stratification Combining structured and unstructured data for Patient diagnosis and treatment outcomes Avoid adverse drug events Optimize diet Consider psychological effects of treatments