The power of the data-centric approach

Good data scientists focus on models, great data scientists focus on data

Marcos Esteve
5 min readDec 13, 2022

After more than 4 years in the data science world, my vision of how to approach problems has changed. In this post, I will try to explain how my idea has evolved from a model-centric to a data-centric view and which are the benefits of this change.

One of those stories that are a click moment, some years ago I was working on a project where the objective was to predict certain values of different chemicals (a typical time series problem). The problem was that the data was inconsistent, with plenty of Null values (between 60–80% of null values). Those that have worked in time series, probably know all the problems you have when you have such a big number of null values. After many conversations with the clients to explain the problems of such a big number of nulls and, in order to try to solve the problem and provide a model to the client I was trying very complex models like Prophet, Xgboost, LSTMs and many others, but as expected, the performance metrics were too low.

Another story, also from some time ago, I was working on a text classification problem. At that time the data was very scattered so we decided to start gathering more texts and send them to some annotators. After gathering thousands of sentences we started training different models and using very complex models like transformers but the performance metrics again were so bad. After days of research, we figured out the problem, which was that we were having sentences wrongly annotated by the annotators.

In the end, your model is as good as your data, and if you have garbage as input for the model no matter the complexity of your model or the hyperparameters, your metrics are going to be garbage. The above statement is very obvious but sometimes as data scientists, we think that the data is perfect and complex deep neural networks with the best hyperparameters will have great performance metrics.

So, in order to solve the problems mentioned above and many others the only thing you need to do is debug your data and for that, the data-centric approach can help a lot.

But first, let’s define in a technical way the differences between the model-centric view and the data-centric view.

Model-centric view 🧠

Model-centric view

In the model-centric view, the objective is to make iterations on the ML model considering the dataset as fixed. That means, we consider the dataset as truth and we create new models by changing the model, adding new layers to our neural network or doing better hyperparameter tuning. This is the typical Kaggle problem where we need to get the best performance score with the data we have available.

Data-centric view 💿

Data-centric view

On the data-centric view, we do iterations on the dataset by considering the model as fixed. For that our objective is to audit the data, find bias inside the dataset or annotate more data. For all of this, we can also take advantage of ML techniques. In the following section, I will try to document some of the advice that I would like to know when I was starting in this great world…

1 Do a good Exploratory Data Analysis (EDA) 👨🏻‍💻

Here your data is your enemy and you need to get the truth, so your objective should be to ask a lot of questions to try to find errors in your data. Some of the questions should be oriented to find some bias. For example, on a text classification task with multiple classes some questions you can do are:

  • Are texts from one class shorter / longer than another class?
  • Are texts from one class containing more stopwords/verbs/adjectives/etc than another class?
  • Do we have a balanced distribution of samples per class in our dataset?
  • Are the texts from production similar to the ones we are going to use in our training dataset?

Depending on your task you will have different questions/answers and with this information maybe you found some problems in your dataset

2 Clean your dataset 🧹

As a second step, cleaning your dataset is a good bet, for that you can use more advanced techniques like training an ML model on your dataset and use the loss of each sample to see which ones are noisier for your model.

This technique is used across all the industry, and for example, people like Andrej Karpathy (Tesla/OpenAI) supports it:

For example, in text classification tasks, the Argilla team found more than 50 mislabelled examples in the AG News dataset. Check here for a tutorial.

3 Acquire more data

Acquire more data is relevant in two scenarios:

  1. You don’t have enough samples from a specific class
  2. Your model is in production and you need to continuously monitor your predictions to ensure the model does not suffer degradation of quality metrics over time

For that, my biggest recommendation is to take advantage of Active Learning to select which samples are more interesting to be annotated. This will reduce the number of hours you need from annotators and also help you to increase the metrics quicker than using other approaches not based on active learning.

Also, when you are acquiring more data it is important to measure the quality of the annotations by, for example measuring the Inter Annotator Agreement . For that, you will need to have multiple annotators doing the annotation of the same sample. These techniques are very important to reduce the number of biases each annotator introduces.

Wrap up

Data is the key and your model can be as good as your data is. So, in this post, I try to explain the differences between a data-centric approach and a model-centric approach. Besides presenting the data-centric approach, some techniques were presented to improve the quality of the datasets. Feel free to comment on other techniques/tools that can help.

As a personal opinion, I strongly think that teams need to focus on improving the datasets, and establishing processes to get new data with the quality expected. Once the processes for acquiring more data, and cleaning and auditing the datasets are created, teams should focus on improving the ML architectures.

About the author ✍🏻

Marcos Esteve is a ML Team Lead at Ravenpack. At his work, Marcos lead and develops machine and deep learning models for a huge variety of Natural Language Processing tasks. He is quite interested in multimodality, search tasks, Graph Neural Networks and building data science apps. Contact him on Linkedin or Twitter

--

--

Marcos Esteve

Data Scientist & Machine Learning Engineer. Learning about GNN in my free time. Personal webpage: https://marescas.github.io