March 2021 - Valkuren

How to carry out a data science project? (Part 2)

Step 4: Model Data

We could separate this “model data” step in 4 different steps:

1. 1. The feature engineering is probably the most important step in the model creating process. The first thing to define is the term feature: are called feature the raw data as received by the learning model. The feature engineering is therefore, all the actions carried out on the raw data (clearing them, deleting null data, deleting aberrant data) before these data are taken into account by the algorithm, and thus the model. In summary, feature engineering is the extraction of raw data features that can be used to improve the performance of the machine learning algorithm
  2. The model training is the action of feeding the algorithms with datasets to start learning and improving them. The ability of machine learning models to handle large volumes of data can help manufacturers identify anomalies and test correlations while searching the entire data stream for models to develop candidate models.
  3. The model evaluation consists of assessing the created model through the output given by the model after having process data through the algorithm. The aim is to assess and validate the results given by the model. The model could be seen has a black box; you have the input that are given to the model algorithm (the dataset) in the model training and the output that are asses during the model evaluation. After having assess the results, you could optimize your model in the previous step.
  4. The model selection is the selection of the most performing and adapted model from the set of candidate model. This selection depends on the accuracy of the results given by the model.

Step 5: Interpret results

The main point about interpreting results is to represent and communicating results in a simple way. Indeed, after having process the previous step results could be heavy and hard to understand.

In order to make a good interpretation of your results, you have to go back to the first step of the data science life cycle that we have cover in our last article, to see if your results are related to the original purpose of the project and if they are any interest in addressing the basic problem. Another main point is to see if your results have sense. If they are and if you answer pertinently to the initial problematic, then you likely have come to a productive conclusion.

Step 6 : Deployment

The deployment phase is the final phase of the project life cycle of a data science project. It consists in deploying the chosen model and applying new data to it. In other words, putting its predictions available to user or service system is known as deployment.

Although the purpose of the model is to increase understanding of the data, the knowledge gained will need to be organized and presented in a way that the client can use and understand it. Depending on the needs, the deployment phase may be as simple as producing a report or as complex as implementing a reproducible scientific data process.

By following these steps in your data science project process, you make better decisions for your business or government agency because your choices are backed by data that has been robustly collected and analysed. With practice, your data analysis gets faster and more accurate – meaning you make better, more informed decisions to run your organization most effectively.

Valkuren team

Today we are going to let our team member present themselves to you!

Hey, I’m Brieuc, Sales & Business developper. At Valkuren, I can be accomplished in my work by combining my several points of interest in everyday business: Marketing, strategy, management and data.

“In Tech we work, in Human Link we trust”, I’m Valérie, the managing partner of Valkuren, Everyday I’m happy to work, share knowledge, drive the team @ Valkuren and bring expertise to the customers in their Data Analytics Strategy. As a founder I promote gender equality & diversity. I’m really proud of the Valkuren Team and the work we do & deliver.

Hello, I am Lienchi. I am a Data/BI analyst at Valkuren. I use powerful tools to guide businesses to change, improve their processes and optimize their data. It allows me to be creative and it is a really rewarding journey.

Hi! I’m Mathilde and I am an intern in HR consultant at Valkuren. My motto: a happy worker is a productive worker! So, I am working on modifying/creating processes, that comply with Belgian laws in order to improve the employee’s wellbeing in the company.

Hello everyone, I am Arthur and I am currently working as a Data Scientist at Valkuren. I am always enthusiastic to embark on a new data-driven journey, looking for some nice insights for your business!

Hey I’m Uendi, Data Scientist at Valkuren. As a mathematics graduate, I’m happy to admit that science is my passion and talent. I often find myself exploring its applications in my life. At Valkuren, this passion of mine was re-established and I was reassured about my path in the shoes of a woman in science and technology. “Valkuren is the key to my personal and professional growth, the only way to do great work is to love what you do.”

Hello, I’m Magnus, Data Scientist at Valkuren. I thoroughly enjoy problem solving and have a passion for data-driven decision making, keeping up to date with the latest tools and techniques. I have particular interests in methods such as conformal prediction, recommender systems. Thanks to a background in mathematics and computer science I help clients leverage their data into actionable insights.

How to carry out a data science project? (Part 1)

To be completed in a qualitative way, a data science project must follow a certain methodology composed of 6 different steps.

Step 1 : Project understanding

In this step we’re looking to fully grasp the scope of the project and typically determine the following:

- - The problem
  - The potential solution(s)
  - The necessary tools & techniques

For this purpose, several questions could be asked:

- - What is the objective of the project?

- - How will this project add value?

- - What data do we possess? What is the format of these data?

- - Regression or classification problem?

Step 2 : Data mining and processing

In his own, this step is composed by 3 level:

Data Mining:

The data mining process identifies and extracts the useful information defined in Step 1. You have first to identify data sources, in a second step you have to access the storage space and in a third time you have to retrieved relevant data.

Quality assessment:

Having the data is not all, it is necessary to check them and judge their reliability. In this aim, you have to determine which data are usable data, if there is any missing or corrupt value. And you have to check also the consistency of the data. In other word, this step help to check the veracity of the data that are given, to find if there is any error. You can check it thanks to statistical tools, like QQ plot.

Data cleaning:

Real world data is often noisy & presents quality issues. The quality assessment step provides a clear overview of the discrepancies in the data. The data cleaning process deal with this discrepancy. This step has the aim to correct quality flaws, transform the data and remove those which are fault.

Step 3: Data exploration

Data exploration is the first step of the data analysis. The goal is to synthesize the main characteristics of these data. The purpose of this step isn’t to draw important conclusions, but to become familiar with the data, see general trends. It is also important for the detection of errors in the data. There is different pole in the data exploration: correlation analysis, descriptive statistics, data visualisation, dimensionality reduction. In each pole you can use different statistic tools as you can see in the diagram below.

Manual or automatic methods are used to make data exploration. Manual methods give analysts the opportunity to take a first look and become familiar with the dataset. Automatic methods, on the other hand, allow to reorganize and delete unusable data.

Data visualization tools are widely used in order to have a more global view of the dataset for a better understanding and to distinguish errors more easily. Moreover, to make this possible, the main programmatic tools used are the language R and Python. Indeed, their flexibility are highly appreciable by the data analysts.

Catch up the 3 lasts steps in our next article.

Month: March 2021