Data Analytics and The Steps
“Data is the new oil” or “Data is the contemporary fuel”, are some famous quotes which pinpoint demanding sense of data and to symbolize data as organic material. Data, just like oil, can be considered an elementary resource that requires further processing before literally being of use.
- Data analytics is the process of inspecting the gigantic data sets, commonly known as “BigData”, to divulge the varied connections, correlations, trends, partnerships, customer behavior, statistical patterns, and other meaningful interferences that aid organizations to make better decisions about the business. All these things provide insights which basically predict the possibilities for augmentation, formulate businesses to modify in market dynamics and locate organizations to resist troublesome new aspirants in the respective industries. Numerous companies such as Amazon, Google, Facebook and Capital One have constructed their complete business model across analytics. The person once knows how and when to play with data, he can apply suitable analytical tools to extract influential & actionable insights from data and make imperative advantages. Individually equipped with business analytics skills results in a sharp business decision, huge profit limits, adequate operations, and amused customers.
· Data Analysis vs Data Analytics
While talking about Data Analytics, we must answer one confusing the question, which is Data Analysis vs Data Analytics. Data analysis refers to the process of examining the components of a given data set – separating them out and studying the parts individually and their relationship between one another. Data analytics, on the other hand, is a broader term referring to a discipline that encompasses the complete management of data – including collecting, cleaning, organizing, storing, governing, and analyzing data – as well as the tools and techniques used to do so. Both the terms have different the approach towards data- Data analysis looks at the past, while data analytics tries to predict the future.
Data Analytics Cases & examples:
1. 80% of the data available on the internet today is in a non-structured format. Most of the social media data is in the form of text. This is used to analyze customer sentiments.
Eg. If a person using negative words like repair, slow, trouble, waste, refund, return after buying your product, the company can assume that he is not happy with the product.
2. HR Analytics: The company spends 20% of the CTC of the new employee to upskill him to make him suitable for the company. But when employee attrition occurs, it a loss for the company. Data scientists have developed an algorithm powered by key indicators that, when taken together cab dramatically predict an employee’s intent to stay with the company or leave.
3. Customer lifetime value: Loyalty cards are given by the big shops and brands so as to track customers and their spending. Data can be used to promote relevant offers and attract the customers during seasonal sales period.
4. Health care analytics: Framingham heart study identified risk factors for heart diseases. With that, The 10-year cardiovascular risk of an individual can be estimated based on characteristics and habits given by the customer.
Data analytics steps:
These six data science steps will help ensure that you realize business value from each unique project and mitigate the risk of error. The steps in-detail are explained ahead. Python & R are two of the specialized coding languages which can be used to complete all the following steps in a data analytics project.
Step 1: DEFINING THE GOAL
Before one can even think about the data, it is needed to finalize what output is required from analytics or what processes are to be improved with collected data. We must define a timeline and concrete key performance indicators. To have motivation, direction, and purpose, you have to identify a clear objective of what you want to do with data: a concrete question to answer, a product to build, etc.
Eg. Sales forecasts, Decision making, Consumer behavior study
Step 2: COLLECTING THE DATA
This step starts with looking & searching for data. Mixing and merging data from as many data sources as possible is what makes a data project great. Collected is data is generally of the following 3 types:
· Organisation data- Eg. Official organizational parameters like sales, profits, revenues, growth rates, trends, product ranges, etc.
· People data- Eg. Customer preferences, transaction history, social media behavior, demographics.
Few ways to get some usable data:
Connect to a database: we can connect with IT teams for the data that is available or open up private database and start digging through it to understand what information is collectible.
Use API (Appl. programming interface): We can think of the APIs to all the tools which the company has been using and the data it is been collecting. One must work on getting these
all set up so he can use those emails open and click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc.
Look for open data: The The Internet is full of datasets to enrich what you have with additional information. Few examples of open data can be:
· Census data will help you add the average revenue for the district where the user lives
· Google maps can show you how many coffee shops are on a given street where company is planning to open an outlet
· CRISP-DM: Cross Industry standard process for data mining where standards are defined in a collaborative way by various researchers working in a similar field.
Step 3: CLEANING THE DATA
While data cleaning, we much ask ourselves the following questions:
· Completeness- How to complete missing value if any
· Consistency- Are the data variables consistent, how to standardize them
· Presence of outliers- Does data contain any extreme values that might affect the analysis
· Presence of redundant variables- Are there any redundant, repetitive variables which need to be removed
It is a dreaded data preparation process that typically takes up to 80% of the time dedicated to a data project. We must start digging to see what we have got and how can it be linked together to achieve the original goal. Start taking notes on your first analyses and ask questions to concerned people to understand what all their variables mean.
Cleaning also includes looking at every one of the columns of the data to make sure that data is homogeneous and clean. An important element of data preparation not to overlook is to make sure that the data and the project are compliant with data privacy regulations. Personal data privacy and protection is becoming a priority for users, organizations, and legislators alike and it should be one for you from the very start of your data journey. One should clearly tag datasets and projects that contain personal and/or sensitive data and therefore would need to be treated differently. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of existing data, deduplication, and column segmentation.
Step 4: ENRICHING THE DATA
After having clean data, it’s time to manipulate it to get the most value out of it. Data enrichment phase of the project can be started by joining all the different sources and group logs to narrow your data down to the essential features.
Eg. Data can be enriched by creating time-based features such as:
· Extracting data components (month, hour, day of the week, week of the year, etc.)
· Calculating differences between date columns
· Flagging national holidays
Another way of enriching data is by joining datasets — essentially, retrieving columns from one dataset or tab into a reference dataset.
Eg. Bank has given data of 1 million customers whose loan was approved or rejected. Based on that historical data parameters, we can make a model of which of the future requests to be accepted or rejected.
· 70% of the data is typically used to train the machine for decision making based on given parameters.
· 30% of the data is used to test the same model and check whether it is taking correct decisions.
· Based on the results of the test data, the model can be enriched for providing better results
When collecting, preparing, and manipulating the data, one needs to be extra careful not to insert unintended bias or other undesirable patterns into it. Data that is used in building machine learning models and AI algorithms is often a representation of the outside world, and thus can be deeply biased against certain groups and individuals. The most worrisome thing about data and AI the most is that the algorithm is not able to recognize bias. As a result, when someone trains the model on biased data, it will interpret recurring bias as a decision to reproduce and not something to correct. an important part of the data manipulation process is making sure that the used datasets are not reproducing or reinforcing any bias that could lead to biased, unjust, or unfair outputs.
Step 5: DATA VISUALIZATION
Data visualization is the process of putting data into a chart, graph, or other visual formats that help in analysis and interpretation.
Common data visual formats include Frequency tables, Cross-tabulation tables, Bar charts, Histogram, Line graphs, Pie charts, Heat Maps, Scatter graphs
1. Frequency tables: A tabular summary of data showing the number (frequency) of data values in each of several non-overlapping classes.
Based on the nature of the data, the following frequency distribution can be used:
· Relative frequency distribution: A tabular summary of data showing the fraction or proportion of data values in each of several non-overlapping classes.
· Percent frequency distribution: A tabular summary of data showing the percentage of data values in each of several non-overlapping classes.
· Cumulative frequency distribution: A tabular summary of quantitative data showing the number of data values that are less than or equal to the upper-class limit of each class.
· Cumulative relative frequency distribution: A tabular summary of quantitative data showing the fraction or proportion of data values that are less than or equal to the upper-class limit of each class.
· Cumulative percent frequency distribution: A tabular summary of quantitative data showing the percentage of data values that are less than or equal to the upper-class limit of each class.
2. Cross-tabulation tables: A tabular summary of data for two variables. The classes for one variable are represented by the rows; the classes for the other variable are represented by the columns.
3. Bar charts: A graphical device for depicting qualitative data that have been summarized in a frequency, relative frequency, or percent frequency distribution.
4. Histogram: A graphical presentation of a frequency distribution, the relative frequency distribution, or percent frequency distribution of quantitative data constructed by placing the class intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis.
5. Line graphs: A line graph is a graph that uses lines to connect individual data points that display quantitative values over a specified time interval.
Histogram
6. Pie charts: A graphical device for presenting data summaries based on the subdivision of a circle into sectors that correspond to the relative frequency for each class.
7. Heat Maps: A heatmap is a two-dimensional visual representation of data using colors, where the colors all represent different values.
8. Scatter graphs: A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other variable is shown on the vertical axis. Trendline A line that provides an approximation of the relationship between two variables
Step 6: MODELLING ALGORITHMS (Deploying ML for decision making)
This modeling algorithms of analytics. Can be divided into 3 different types based on what result it is going to produce.
· Descriptive analytics is a statistical method that is used to search and summarize historical data to identify patterns or meanings. Descriptive analytics is one of the most basic pieces of business intelligence. It typically involves usage of descriptive statistics like arithmetic operations, means, median, max, and percentage on existing data. Data aggregation and data mining are two techniques used in descriptive analytics to discover historical data. Data is first gathered and sorted by data aggregation to make the datasets more manageable by analysts.Eg. Most of the social media analytics is descriptive as it mines loads of data and understands trends based on statistical tools
· Predictive Analytics is a statistical method that utilizes algorithms and machine learning to identify trends in data and predict future behaviors. Predictive Analytics can take both past and current data and offer predictions of what could happen in the future. It relies heavily on complex models designed to make inferences about the data. These models utilize algorithms and machine learning to analyze past and present data to provide future trends. Some common basic models that are utilized at a broad level are as described ahead.
1. Decision trees:
Decision trees use branching to show possibilities stemming from each outcome or choice. The following are the types. Decision Tree algorithms are referred to as CART (Classification and Regression Trees). Decision trees have a natural “if … then … else …” the construction that makes it fit easily into a programmatic structure. Regression trees are used when the dependent variable is continuous. Classification Trees are used when the dependent variable is categorical.
The common terminologies used in decision trees are as follows:
· Root Node: The root node represents the entire population or sample, and this further gets divided into two or more homogeneous sets.
· Decision Node: When a sub-node splits into further sub-nodes, this type of nodes are called decision node.
· Terminal Node: Nodes do not split are called Terminal node.
· Pruning: If the sub-nodes of a decision node are removed by combining, the process is called
pruning.
· Branch: A subsection of entire tree is called branch.
· Parent and Child Node: A node which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are called the child of the parent node.
Eg. Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class (IX/ X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, we want to create a model to predict who will play cricket during the leisure period? In this problem, we need to segregate students who play cricket in their leisure time based on a highly significant input variable among all three. Following is the decision tree for important variables gender & height.
2. Regression techniques: assist with understanding relationships between variables. One variable (X) is called independent variable or predictor. The other variable (Y), is known as dependent variable or outcome and the simple linear regression equation is:
Y = Β0 + Β1X, Where
X – the value of the independent variable, Y – the value of the dependent variable.
Β0 – is a constant (shows the value of Y when the value of X=0
Β1 – the regression coefficient (shows how much Y changes for each unit change in X)
Eg. We have to examine the relationship between the age and price of used cars sold in the last year by a car dealership company. Here is the table of the data:
|
Car Age (in years) |
Price (in dollars) |
|
4 |
6300 |
|
4 |
5800 |
|
5 |
5700 |
|
5 |
4500 |
|
7 |
4500 |
|
7 |
4200 |
|
8 |
4100 |
|
9 |
3100 |
|
10 |
2100 |
|
11 |
2500 |
|
12 |
2200 |
Now, we see that we have a negative relationship between the car price (Y) and car age(X)
– as car age increases, price decreases. When we use the simple linear regression equation, we have the following results: Y = Β0 + Β1X --- Y = 7836 – 502.4*X. We can use the data from the table and create our Scatter plot and linear regression line:
Result Interpretation:
With an estimated slope of – 502.4, we can conclude that the average car price decreases
$502.2 for each year a car increases in age.
3. Time series models
A time series model forecasts the values of a numerical data field for a future period. In contrast to regression methods, time series predictions are focused on future values of an ordered series. The time series algorithms are univariate algorithms meaning the independent variable is a time column or an order column. The forecasts are based on past values.
· Basic Structures
The following two structures are considered for basic decomposition models:
Additive: = Trend + Seasonal + Random Multiplicative: = Trend * Seasonal * Random
The additive model is useful when the seasonal variation is relatively constant over time.
The multiplicative model is useful when the seasonal variation increases over time.
The “Random” term is often called “Irregular” in software for decompositions.
Eg. The following time series plot is plotted based on the data of beer production in Australia. The seasonal variation looked to be about the same magnitude across time, so an additive decomposition might be good.
• Prescriptive analytics is a statistical method used to generate recommendations and make decisions based on the computational findings of algorithmic models. Prescriptive analytics is considered an extension of predictive analytics. Prescriptive analytics require complex algorithms to accomplish such machine-based decision-making. It can suggest all favorable
outcomes according to a specified course of action and suggest various courses of action to get to a particular outcome. Hence, it uses a strong feedback system that constantly learns and updates the relationship between the action and the outcome.
Prescriptive analytics consists of two categories of algorithms:
1. Heuristics are a set of problem-dependent rules. Heuristics use highly specialized techniques designed to take advantage of a particular aspect of the problem. They typically require developing either a set of mathematical functions (e.g. f(x)=y); a set of instructions (e.g., “If this… then do this”); or both. Some examples of heuristics are a Genetic Algorithm (GA), Support Vector Machine (SVM)
2. Optimization is a combination of mathematical modeling and exact algorithms used to find the optimal answer. A problem is defined by writing math equations using a model-building platform. Once the model is created, it is sent to a highly specialized algorithm used to solve the problem. Some of the Important optimization models are
· Stochastic optimization is the process of maximizing or minimizing the value of a mathematical or statistical function when one or more of the input parameters is subject to randomness. The word stochastic means involving chance or probability.
· The fuzzy optimization approach is known to be useful for multiobjective and multiconstraint decision situations in which the objectives and constraints are approximate. Fuzzy Logic (FL) is a method of reasoning that resembles human reasoning. The approach of FL imitates the way of decision making in humans that involves all intermediate possibilities between digital values YES and NO. The fuzzy logic algorithm helps to solve a problem after considering all available data. Then it takes the best possible decision for the given input.
Nice implementation
ReplyDeleteNow I understand what data analytics is for, great work. Don't forget C
ReplyDelete