Presented to Meritorious. Professor .Dr.Aqil Burni Head of Actuarial Sciences Institute of Business ManagementQ#1 Big Data Adnan Alam Khan Std_18090
Page 1
What is Big Data?
Big Data Analysis
Definition: Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.[wikipedia] Big Data is the new buzzword sweeping the worlds of IT and analytics. The number of new data sources and the amount of data generated by existing and new sources is growing at an incredible rate. Some recent statistics illustrate this explosion: *Facebook generates 130 terabytes in data each day, just in user logs. An additional 200-400 terabytes are generated by users posting pictures. *Google processes 25 petabytes (a petabyte is about 1,000 terabytes) each day. *The large Hadron Collider, the world's largest and highest-energy particle accelerator, built by the European Organization for Nuclear Research, generates one petabyte of data each second. The total amount of data created worldwide in 2011 was about one zetabyte (1,000,000 petabytes). This is projected to grow by 50%-60% each year into the future. With all of this data, the demand for skilled statistical problem solvers has never been greater. Characteristics of Big data. Big data can be described by the following characteristics: Volume The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered as Big Data or not. The name Big Data itself contains a term which is related to size and hence the characteristic. Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. Velocity - The term velocity in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. Variability - This is a factor which can be a problem for those who analyze the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the complexity of Big Data. Technical definition of Big data. Despite all of the above, we still need a good definition of Big Data. Two such definitions come to mind. The Three Vs: Volume, Velocity, and Variety. Here, Big Data come in massive quantities, may come at you fast, and may also come in various forms (e.g. structured and unstructured). To this, we add a fourth V: Variability. Big Data
Adnan Alam Khan Std_18090
Page 2
Big Data Analysis
is highly variable, covering the full range of experiences in the human condition and the physical world. The second definition may have more appeal to some: Big Data are data that are expensive to manage and hard to extract value from. This definition recognizes that the many forms and varieties of Big Data may be difficult to collect, manage, process, aggregate, or summarize. The Berkeley AMP lab suggests three pillars to deal with Big Data. New algorithms are needed to deal with Big Data. They recognize that many existing statistical and data mining algorithms will not scale, nor will some existing software handle Big Data (e.g. R is well-known to have issues with memory). Furthermore, as data becomes bigger, using sampling for prediction or projection may miss important facts and phenomena. Diagrammatic explanation
Q Why we are doing?
Q What is information insights Doing with Big Data? Information insights is meeting the challenge of Big Data by organizing our efforts around algorithms, machines, and people. Algorithms: We have taken the most state-of-the-art statistical method, namely hierarchical Bayesian statistical models, and we have parallelized their implementation. This has resulted in blazing speed and the ability to do sophisticated analysis on terabytes of data. We continue to innovate by adding new statistical models that can be used in a variety of applications, including advanced econometrics, variable selection, hidden Markov models, Bayesian data fusion, tree-structured models, Bayesian CART, and Random Forests.
Adnan Alam Khan Std_18090
Page 3
Big Data Analysis
Machines: We have purchased our own in-house High Performance Computation Cluster (cloud), which has been tuned for high speed, complex mathematical and matrix calculations. We are able to get over 80% of theoretical performance from this HPCC, compared to the 50% or so obtained when not so tuned. We currently have over 100 computation cores and 1/2 terabyte of RAM. Expansion is expected to add an additional 128 cores and 4 terabytes of RAM. People: We have invested considerable time in finding and training super smart people. The company's atmosphere and ethos combine to foster collaboration and communication, and innovation.