Data acquisition, quality vs quantity and governance & management



Published
Dr. Aarne Klemetti
Data acquisition, quality vs quantity and governance & management.

Researching lecturer Aarne Klemetti has 30+ years of experience in RDI and education in computer science, industrial economics, and media industry. He has been involved with AI related activities since the days of expert systems and their practical applications in the 1980s. His current activities are in the AI/ML/data-related practicalities of the scalable edge and osmotic computing. 

(1) Data acquisition: the options for developing customized AI solutions are either collecting own data and running machine learning from scratch, or using pretrained/foundation models and fine tuning the models for required purposes. The questions to be answered are: 

  

Where does the data come from? 

What do we need to do with the preparation of the data? 

Do we understand our requirements in reflection to our data? 

How to be sure of the origins and nature of data, i.e. transparency? 

How long the data used/collected are valid? 

Can the sensitive data be hidden during the ML process, for example applying federated learning? 



(2) Quality vs quantity: How much is enough and how to validate the data in order to avoid discrepancies and for example skewness/kurtosis induced outliers. The questions to be answered are: 



When is the danger of overfitting of models? 

Do the tests and validation activities support the model credibility? 

What is the level of explainability of the models? 



(3) Data management and governance: while the RDBMSs are still valid the focus is in knowledge graphs and vector databases. The implementation of DataOps, ModelsOps/MLOps provide the frameworks here. The key aspects are to be able to have control through the whole data collection and machine learning processes. The questions to be answered are: 

  

Have we organized feature engineering and other activities consistently? 

How to provide and maintain good practices of our data flow? 

How can we continue trusting to our AI/ML models?
Category
Management
Be the first to comment