Hang Dang

Jan 17, 20235 min read

Data centric - future of your AI development

Updated: Feb 1, 2023

Every AI solution consists of two components: code (model) and data. Over the years, developers have tended to place more emphasis on code efficiency than handling appropriate data sets in an effort to increase the accuracy of machine learning models. That can only get us so far. Today, there is a problem with the way AI is being approached that causes us to look more carefully at the data. This is how we start with Data centric - future solution for your AI development.

Modern machine learning relies heavily on data, yet AI initiatives frequently overlook or improperly manage it. As a result, hundreds of hours are wasted trying to fine-tune a model that was created with bad data. That is the primary cause of your model's accuracy being substantially lower than anticipated; it has nothing to do with model tuning. Because of this, we must focus on expanding and upgrading databases, which may be accomplished through a data-centric strategy.

What is Data centric?

A data-centric outlook is a core concept in data-centric architecture, in which data is seen as a crucial and permanent asset that is used to support applications and produce deliverables. In a data-centric architecture, the data model exists before any given application is implemented and endures long after the application has been abandoned.

Data must guide the creation of projects, designs, business choices, and culture in a data-centric approach. Organizations can now remotely access and analyze big databases in order to make more unbiased, risk-reducing, and profitable decisions thanks to the development of cloud computing and storage.

Data centric vs model centric - the answer is clear

Currently, there are primarily two methods for developing the accuracy of AI systems.

The model-centric approach focuses on improving the algorithm, code, and model architecture used to train the model.

The data-centric approach where the focus is to develop systematic engineering practices for improving data in ways that are reliable, efficient, and systematic.

Key difference

So, here are obviously the key difference.

In model-centric AI, the focus is to get the code (model) right while in data-centric AI, the focus shifts to data as shown in the picture below.

Which one is the best choice?

The man commonly recognized as the father of contemporary artificial intelligence, Andrew Ng, has a strong opinion on this matter. He adamantly thinks that the focus of AI ecosystems needs to change from being model-centric to being data-centric. He asserts that this will have a substantial effect on the effectiveness of AI systems put in place. This indicates that the training phase of a machine learning model's life cycle is rather brief. It means that 80% of the lifecycle of an AI system is spent obtaining and preparing high-quality data, while only 20% is spent actually training the system. Despite this, both industry and academia have prioritized training.

More than 90% of AI research articles focus on algorithm enhancements rather than the value of high-quality data across the whole AI lifespan. Andrew claims that this is a big disadvantage in real-world use cases because high-quality data is essential and improves benchmark outcomes.

However, it should be highlighted that a successful AI application depends on a mix of a well-designed model and high-quality data rather than just good data or good models. Data-centric AI shows how we frequently neglect the data in favor of spending too much time on model architectural improvements. Data is only a small part of AI research (1%). According to Andrew Ng, a data-centric AI strategy produces models that perform better.

	Steel defect detection	Solar panel	Surface inspection
Baseline	76.2%	75.68%	85.05%
Model-centric	+0% (76.2%)	+0.04% (75.72%)	+0% (85.05%)
Data-centric	+16.9% (93.1%)	+3.06% (78.74%)	+0.4% (85.45%)

How data-centric benefits your business

There are many benefits of data-centric AI, but some of the most notable ones include improved accuracy, increased efficiency, and reduced costs. Machine learning models are only as good as the data that they are trained on, so by focusing on data instead of code, businesses can ensure that their models are always up-to-date and accurate.

Improved Performance

The goal of a data-centric approach is to have quality, consistent information that can be used by the AI system. The more accurate and reliable this input becomes over time—the better it will perform in abilities such as learning new concepts or making predictions about future outcomes.

Promotes Collaboration

The data-centric approach to quality management promotes collaboration between managers, experts and developers. They can work together during the development process for defects/ labels that will be resolved by reaching consensus on them or building models before analyzing results so they may make further optimizations if needed.

Eliminate Wasted Time

The data-centric approach reduces development time by allowing teams to work in parallel and influence the AI system's accuracy. By eliminating unnecessary back-and forth among groups, this helps save valuable resources for other tasks that require more attention.

Which business approach to Data-Centric AI?

Already, some of the world’s big Internet and IT companies like Amazon, Google, and Microsoft have become data-centric businesses. From the start of 2019, data-centricity was on the list of the top 5 trends to watch out for in the world of business.

Which one to prioritize: data quantity or data quality?

Before going any further, I’d want to emphasize that more data does not automatically equal better data. Sure, a neural network can’t be trained with a few images, but the emphasis is now on quality rather than a number.

Data quantity

It refers to the amount of data accessible. The main goal is to gather as much data as possible and then train a neural network to learn the mappings.

Data quality

Data quality, as the name suggests, is all about quality. It makes no difference if you don’t have millions of datasets; what matters is that they are of high quality and properly labelled.

A low-quality piece of data means that flaws and inaccuracies can go undetected indefinitely without any consequences. The accuracy of models depends on the quality of your data; if you want to make good decisions then you need accurate information. Data with poor attributes is at risk for containing errors and anomalies which can be very costly when using predictive analytics and modeling techniques.

In general, more data leads to more reliable models and therefore better results, but as long as the data is real and representative. It is preferable to use less data, rather than more volume but with poor quality. Although sometimes the amount of quality data is insufficient to train and model the problem to be solved, and therefore provide a solution based on Data Analytics and Artificial Intelligence.

Another recurring problem is that, although the data set to be analyzed is sufficient to take full advantage of Artificial Intelligence systems, there is always a tendency to collect additional data due to the low cost of storage and processing power. The current trend of generating and storing large volumes of information does not seem to diminish in the future. That is why it is important for companies to establish a set of rules and procedures that define and regulate how the data will be treated. To facilitate data governance and ensure the success of advanced analytics and AI solutions.

Data-centric with Pixta AI

In Pixta AI’s full-packaged annotation service, we apply data-centric approach along with human-in-the-loop models to optimize all of your auto image annotation projects. Even if you are a small business or startup with a tight budget, don’t worry because Pixta AI makes AI and data-centric approach accessible to everyone. With our PixtaStock - a 100M visual data library, we can provide you any dataset with big quantity, high quality and full of compliance.