The after the first cycle the data scientists have

The Data Mining Process – By Shreya SrivastavaData mining involves going through large data sets  in order to identify any patterns and establish relationships to solve problems through data analysis thereby allowing us to predict the future trends. And thus data mining is a craft as it involves the application of a great amount of science and technology but with there should be a well understood process that can analyse huge amount of data and look for patterns with consistency, repeatability and objectiveness in order to find any patterns or establish any relationships to predict the future trends. A useful codification of data mining process is given by Cross Industry Standard Process for Data Mining (CRISP- DM) which involves various steps like Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment.Going through the entire process once and not solving the problem in that process is not considered as failure because the first process cycle is just the data exploration i.e. after the first cycle the data scientists have much more knowledge about the data than they had once they first got hold of the data. And thus, after the first cycle where the data exploration was done, the next few iteration refines the data much more and give us some well explained analysis. Let’s explore each of the steps in detail.Business Understanding: The very first thing that we need to do is understand the business problem i.e. what we are trying to accomplish. This seems to be a simple step but many businesses gives a very ambiguous requirement. Many a times we have to recast the problem and design the solution through an iterative process. In this stage the analyst’s creativity plays a major role in casting the business problem. A data scientist with great foundational knowledge can see through the problem in an extremely creative way and help make a great understanding of the problem. For example if we get  a large data set we need to figure out about what exactly do we want to do with this data? And how exactly shall we do the same? What possible data mining model shall be used? Etc. in doing so we will get a simplified view of the use scenario and also by doing this iteratively we will refine the use scenario further and further to get a close match with the business problem. Data Understanding: The huge amount of data which is collected is just a pile of raw material which we will be refining, cleaning, filtering and analysing to get a solution. And thus it is important to find the strengths and limitations of data because in most of the cases there is a lot of discrepancies between the data and the problem. The cost of data may vary as well; for example some data will be available for free, while some may require effort, others will need to be purchased, while some won’t even exist and thus will require some methods or projects to obtain them. Even after obtaining all the data, collating them also requires effort. For example cleaning and matching customer records to ensure there is only one record per customer is a complicated analytical problem which needs to be solved in this part of the CRISP cycle. Data Preparation: Many a times the analytical technologies we use require the data in different form compared to the data we have and some conversions of data are necessary and thus data preparation proceeds along with the data understanding phase  thereby manipulating and converting the data in order to yield better results. For example, converting the data into tabular form, removing or inferring missing or duplicate values and converting the data type to another form. Modeling: Now after the business have been understood, the data have been understood, cleaned and prepared, now we start applying the data mining techniques on this cleaned and prepared data. We can also apply multiple modeling techniques. In that case we must perform each modeling technique separately. Before we actually build the model we prepare a test to check the model quality and validity For example, in supervised data mining tasks such as classification, it is very common to use error rates for quality measures for data mining models. Therefore, we must typically separate the dataset into train and test sets, build the model on the train set, and estimate its quality on the separate test set. After doing so we run the model on the data set in order to create one or more data model. Evaluation: The main purpose of the evaluation stage is to test the data mining result and gain confidence that the results are valid and reliable before moving onto the next phase of deployment.  On carefully looking through the data we may find patterns and check whether these patterns which are extracted from the data are true regularities and not just irregular or anomalies. We should never deploy data model results immediately after data mining process as it is usually easier, cheaper, quicker, and safer to test a model first for accuracy. Also, this stage also verifies whether the model satisfies the original business goals. Both quantitative and qualitative assessment is required for evaluating the result of the data mining process. Thus we get an idea about the correctness of the model in this phase. Deployment: In this phase the results of the data mining are put into actual use in order to get the solution to the problem or the return of investment. The simplest and the clearest deployment includes implementing of the predictive model on some business process. Nowadays, the data mining techniques themselves are being deployed. For example the online targeting advertisement systems are deployed to flood your website with advertisements of the products which are related to the ones you browsed earlier. Also, deploying a model into a production system requires that the model be re-coded for the production environment, which is generally for a greater speed or compatibility with an existing system. This may incur an increase in expense and investment.Regardless of whether the deployment was successful or not the process often returns to the business understanding and the data understanding phase and thus a second iteration may yield a better and improved solution. REFERENCES: 1 Data Science for business by Fostor Provost and Tom Fawcett