Working with data is a complex thing and not done in some days. It is rather a matter of several sequential steps that lead to a final output. In this post, I present the data science process for project execution in data science.
What is the Data Science Process?
Data Science is often mainly consisting of data wrangling and feature engineering, before one can go to the exciting stuff. Since data science is often very exploratory, processes didn’t much evolve around it (yet). In the data science process, I group the process in three main steps that have several sub-steps in it. Let’s first start with the three main steps:
- Data Acquisition
- Feature Engineering and Selection
- Model Training and Extraction
Each of the different process main steps contains some sub-steps. I will describe them a bit in detail now.
Step 1: Data Acquistion
Data Engineering is the main ingredient in this step. After a business question was formulated, it is necessary to now look for the data. In an ideal setup, you would already have a data catalog in your enterprise. If not, you might need to ask several people until you have found the right place to dig deeper into it.
First of all, you need to acquire the data. They might be internal sources but you might also combine them with external sources. In this context, you might want to read about the different data sources you need. Once you are done with having a first look at the data, it is necessary to integrate the data.
Data integration is often perceived as a challenging task. You need to setup a new environment to store the data or you need to extend an existing schema. A common practise is to build a data science lab. A data science lab should be an easy platform for data engineers and data scientists to work with data. A best practise for that is to use a prepared environment in the cloud for it.
After integrating the data, there comes the heavy part of cleaning the data. In most cases, Data is very nasty and thus needs a lot of cleaning with it. This is also mainly carried out by data engineers alongside with data analysts in a company. Once you are done with the data acquisition part of it, you can move on with the feature engineering and selection step.
Typically, this first process step can be very painful and long-lasting. It depends on different factors of an enterprise, such as the data quality itself, the availability of a data catalog and corresponding metadata descriptions. If your maturity in all these items is very high, it can take some days to a week, but in average it is rather 2 to 4 weeks of work.
Step 2: Feature Engineering and Selection
In the next step we start with a very important step in the Data Science process: Feature Engineering. Features are very important for Machine Learning and have a huge impact on the quality of the predictions. With Feature Engineering, you have to understand the domain you are in and what to use with it. One need to understand what data to use and for what reason.
After the feature engineering itself, it is necessary to select the relevant features with the feature selection. A common mistake is the overfitting of a model, or also called “feature explosion”. It happens often that too many features are created and thus the predictions aren’t accurate anymore. Therefore, it is very important to select only those features that are relevant to the use-case and thus bring some significance.
Another important step is the development of the cross-validation structure. This is necessary to check how the model will perform in practice. The cross-validation will measure the performance of your model and give you insights on how to use it. Next after that is the Hyperparameter tuning. Hyperparameters are fine-tuned to improve the prediction of your model.
This process is now carried out mainly by Data Scientists, but still supported by data engineers. The next and final step in the data science process is Model Training and Extraction.
Step 3: Model Training and Extraction
The last step in the process is the model training and extraction. In this step, the algorithm(s) for the model prediction are selected and compared to each other. In order to ease up work here, it is necessary to put all your process into a pipeline. (Note: I will explain the concept of the pipeline in a later post). After the training is done, you can go into the predictions itself and bring the model into production.
The following illustration outlines the now presented process:
The Data Science process itself can easily be carried out in a Scrum or Kanban approach, depending on your favourite management style. For instance, you could have each of the 3 process steps as sprints. The first sprint “Data Acquisition” might last longer than the other sprints or you could even break the first one into several sprints. For Agile Data Science, I can recommend you reading this post.