Our team has deep knowledge and comprehensive experience in developing DB/ML/AI infrastructure for large projects related to Energy and Finance Industries and knows most pitfalls of custom databases that adhere to Tidy principles14. Tidy databases have many advantages in providing concise and clear data storage that is easily maintained and comprehended by humans. At the same time, they impose many issues when used by ML and AI systems. To demonstrate these limitations, let’s look at the issues of most structured ARDN15 datasets available today:
Issue 1 – Incomplete and changing attribute set.
Issue 2 – Limited reporting and modeling opportunities
Issue 3 – Domain-specific data systems are hardly expandable to cover other domains.
Other known issues related to adapting Tidy Data datasets for ML/AI usage exist. Multiple vendors provide customized solutions to address some of these limitations, and Optimax AI offers solutions that mitigate most, if not all, of them. The main objective of Optimax approach is to apply a highly scalable, generic Entity–Attribute–Value (EAV) model.
Different teams within an organization often cannot reuse data stored in various formats across projects. Machine Learning (ML) systems built on traditional data infrastructure are often coupled to databases, object stores, streams, and files. This coupling results in the possibility that any change in data infrastructure may break dependent ML systems. Data Science and Data Engineering teams are turning towards attribute stores to manage the data sets and pipelines needed to productionize data usage.
Optimax AIDB can decouple the models from the data infrastructure by providing a single data access layer that abstracts attribute storage for reporting and model dataset generation.
Entity–Attribute–Value (EAV)18 model is a well-known concept primarily applied to sparse datasets and high volume time series data. The main advantage of this concept is that it is an excellent fit for Machine Learning - centric platforms. Additionally, it has efficient support for sparse data, often resulting from combining results of multiple field studies in a unified set of tables.
The Optimax team knows how to utilize the best of EAV and enhances it to support virtually unlimited sets of vast data generated by field studies in AIDB. It is fully compatible with DS/ML infrastructure supported by Tecton (www.tecton.ai) and its legacy open-source platform Feast. That is why developers familiar with these popular tools could utilize most functionality supported by Tecton/Feast. Furthermore, Optimax AIDB has been extended and optimized to use heterogeneous data by adding three major groups: 1) Case tables, 2) Attribute tables, and 3) Data tables.
Entity or Case tables correspond to different groups of objects where all studies are combined in one group. Table descriptions maintain most data not directly related to primary ML/AI usage (such as detailed study description, authors, etc.). The Case tables are designed to support multiple types of hierarchies of cases. They use different ID ranges to simplify partitioning, speed-up data extraction and report generation, and reduce overall computational cost.
Attribute tables are the most crucial part of AIDB – they are used to define various relationships, groups, and hierarchies of attributes. As a result, the reporting and ML dataset generation engines allow summarizing and aggregating data over different dimensions.
Optimax offers an elegant solution to one of the most critical challenges faced by the data systems based on storing data in generalized tables. The table structure is too generic, making it hard to provide records by record access control. To address this issue the User tables were added to the Optimax AIDB core. These tables support enhanced security requirements of ODF. The User model of Extended Optimax AIDB is identical to the User Model offered by Django Project19
Optimax AIDB is an essential part of the proposed solution. However, what makes it so powerful is a set of engines powering the data flow in and out of the DB. The set of engines starts with raw data provided in CSV/TSV/XLS forms. It finishes with an ML-ready training/validation/testing data set that could be fed “as is” into a powerful ML/AI modeling infrastructure.
Expandable infrastructure to cover different human knowledge domains. Interfaces have been designed to create and audit new groups of cases covering domain provisioning.
AIDB Data Formatting Engine is responsible for generating (a) human-readable reports and (b) files to be ingested by other models.
Both reports and model-specific files are not limited to any particular knowledge field but could be helpful for interdisciplinary research. The most crucial feature of the Data Formatting Layer is the ability to deal with highly sparse data by applying a user-provided availability threshold (0.1 – 100%) to filter out infrequent attributes and generate the detailed reports as well as PyTorch-compatible highly efficient sparse data views forwarded to model training engines.
The Data Formatting Engine's design makes it user-friendly for researchers data scientists, ML Engineers, and inexperienced users. Users can create reports by selecting a few keywords or typing an arbitrary phrase. Then the Named-entity recognition (NER)24 tool combined with a fuzzy matching layer would offer a list of attributes for selection. Users would decide which attributes to use (a) either as-is or for grouping (using one or more hierarchy types and grouping levels), (b) in either row or columns, (c) with optional attribute availability threshold, and (d) optional unit conversions. Although straightforward and generic, the report generation system built on these principles allows users to create a wide range of custom reports by aggregating data on virtually any attribute type, from very narrow research fields to large-scale and cross-disciplinary studies. Advanced users would be able to generate more complicated reports as the system expands to cover other domains.
The main goal of keeping data in AIDB format is to expose all available data, both complete and incomplete, to a wide range of DS/ML/AI tools such as PyTorch, scikit-learn, and numerous others. The data stored in expanded AIDB tables is fully compatible with most ML packages and does not require additional processing. The models could be fed, trained, optimized for hyper-parameters, and used directly close to real time. It opens a window to a vast diversity of modeling, research, and analytical tools as well as builds a gateway to an AI-guided future.
The main goal of keeping data in AIDB format is to expose all available data, both complete and incomplete, to a wide range of DS/ML/AI tools such as PyTorch, scikit-learn, and numerous others. The data stored in expanded AIDB tables is fully compatible with most ML packages and does not require additional processing. The models could be fed, trained, optimized for hyper-parameters, and used directly close to real time. It opens a window to a vast diversity of modeling, research, and analytical tools as well as builds a gateway to an AI-guided future.
The main goal of keeping data in AIDB format is to expose all available data, both complete and incomplete, to a wide range of DS/ML/AI tools such as PyTorch, scikit-learn, and numerous others. The data stored in expanded AIDB tables is fully compatible with most ML packages and does not require additional processing. The models could be fed, trained, optimized for hyper-parameters, and used directly close to real time. It opens a window to a vast diversity of modeling, research, and analytical tools as well as builds a gateway to an AI-guided future.
To prevent any ethical issues, we are going to use Google Cloud Responsible AI29.
Optimax AIDB User Tables provide hierarchical security to maintain copyright and intellectual property rights, allowing fine granularity access control to each case record. We achieve this by keeping user ID and a few other user-related properties directly within case tables. When someone uploads data and creates case records, they assign case-level permissions: private, shared for all ODF users, shared only for premium ODF users, public, etc. Additional access control tables keep associated permission levels for each entity if the data owner requires it. Furthermore, a special set of tables would allow access permissions to individual users or user groups.
A custom Python/Django website backend API (GCP: Cloud Run30, in a docker container) manages data access. Data is stored in a PostgreSQL database (GCP: Cloud SQL31) encrypted on the volume. Users only has access to the data and actions that the organizational admin permits. (We can implement 2FA as an optional organizational feature). The API will allow an authenticated user with proper authorization/permissions to access the big data (GCP: BigQuery), creating an extra layer of separation for better security of large datasets. More information on security can be found in the official Django documentation32.