DATA LEAKAGE PROBLEM IN MACHINE LEARNING

Рубрика конференции: Секция 14. Технические науки
DOI статьи: 10.32743/SpainConf.2022.3.17.336114
Библиографическое описание
Макаренкова В.М., Павлова К.А. DATA LEAKAGE PROBLEM IN MACHINE LEARNING// Proceedings of the XVII International Multidisciplinary Conference «Prospects and Key Tendencies of Science in Contemporary World». Bubok Publishing S.L., Madrid, Spain. 2022. DOI:10.32743/SpainConf.2022.3.17.336114

DATA LEAKAGE PROBLEM IN MACHINE LEARNING

Ksenia Pavlova

Student, Moscow Aviation Institute (NRU),

 Russia, Moscow

Vera Makarenkova

Student, Moscow Aviation Institute (NRU),

 Russia, Moscow

 

ABSTRACT

This article discusses the problem of the impact of data leakage on the quality of models in machine learning, analyses this problem, demonstrates the results of the classifier with different types of leaks, and presents potential solutions to this problem.

 

Keywords: machine learning, data leakage, classification.

 

Introduction

Machine learning is a branch of artificial intelligence. Using computing, we design systems that can learn from data in a manner of being trained [1]. Machine learning allows the automation of human mental and physical labor, as well as significantly accelerating the time it takes to complete tasks. In this regard, nowadays machine learning is used everywhere.

Most scientific articles talk about potentially limitless possibilities of machine learning (ML), but not much is mentioned about its problems and shortcomings. However, there are many cases where ML models do not work as expected.

Data leakage in machine learning is a model failure due to problems with data, its order, uniqueness etc.

Data leakage causes

The most common cause of data leakage is an error in data preprocessing, i.e. the inclusion of test data in the training data or the inclusion of the target variable (which tries to predict the model) in the data on which the prediction of that target variable is based.

Another common reason is the duplication of data. For example, if a dataset contains user messages on some platform, duplicates can occur due to spammers repeatedly sending the same message.

In addition, there is so-called implicit leakage. As an example, consider time series - data that contains a time factor. In this case, the error occurs due to the way the training and test sets are composed.

In some cases, identifying leaks is not very difficult. For example, if the model is so good that it looks implausible, there is most likely a data leak. It is possible that the model is not trained, but is simply finding a relationship between the data and the target.

Another way to identify leaks is an exploratory data analysis, which includes the use of statistical and visualization tools. This way, you can identify data that is highly correlated with the target variable.

Data description

Kaggle provides a dataset containing information about the passengers of the Titanic, which sank in 1912. The target variable is the passenger survival rate. 0 if the passenger died and 1 if the passenger survived. Data contains information about the age, sex, ticket, class of passengers, and other statistical information. This dataset is one of the most popular datasets on Kaggle, which is used for different competitions. The data is great for binary classification; it will help to demonstrate different types of data leaks.

Model description

Logistic regression is one of the most popular classification algorithms by comparing the probability of belonging to a class with a logistic curve. The goal of a logistic regression model is to understand a binary or proportional response (dependent variable) based on one or more predictors [3]. The algorithm is implemented in many data analysis packages including Scikit-learn.

Results

In order to consider the problem of data leakage, the original dataset was pre-processed to run the model. Comparative analysis is based on the accuracy, F1 score, and ROC-AUC score.

Let's compare the original dataset with modifications for different data leaks:

  1. Put 25%, 50%, and 90% of the test data into the training data, therefore forming data leaks of different scales.
  2. Аdd to the original training data their duplicates (about 70%).
  3. Add to the original data a parameter that points to the target variable.

In all datasets incorrect feature values were replaced, missing data were minimized, and features that could not affect the training results were removed. To train a linear model it is necessary to convert all the attributes into numerical values and scale them. This is done using OneHotEncoding and Standard scalar. The target variable is removed from the data for training and comparing the results with the reference.

The table (table 1) shows metrics calculated on test data after training with standard logistic regression parameters.

Table 1.

Metrics for each leak

Data type/

Metrics

Accuracy

F1

ROC-AUC

Original dataset

83%

79%

82%

25% of the test data in the train

77%

62%

72%

50% of the test data in the train

78%

67%

75%

90% of the test data in the train

90%

82%

86%

Data duplication (70%)

91%

86%

89%

“Target feature” in the train data

91%

84%

88%

 

According to the table (table 1), as the percentage of mixing test and training data increases, the quality of the model goes down, and then the metrics begin to increase. This process can be associated with the fact that the model stops catching patterns in the data, loses its generalization ability, and eventually overfits, simply remembering individual data. A similar situation occurs with other types of leaks: the metrics are higher than on the reference dataset, since the learning algorithm does not change for different datasets, we can talk about unrealistic results for experiments with duplicate data and target variables, this also indicates overfitting and unsuitability of the model for predictions on new data. It is noticeable that data leaks can not only degrade the learning capability of the model, but also make it potentially useless; it seems that the metrics get closer to 100%, but based on other data, the model will behave unpredictably because it does not perform the main advantage of machine learning - independent search for patterns in the data. To avoid such situations, it is possible to make attempts to avoid leaks.

Solution methods

The first step to minimize data leakage is to make sure that data do not correlate with the target variable and that it does not contain information predicted by the model. It is also important to separate training, validation, and test sets of data.

Data normalization is a common practice. This is usually done by dividing the data by their average value. Applying normalization to the overall dataset causes information from the test set to affect the training set, which ultimately leads to data leakage. Thus, any normalization should be applied to the training and testing subsets in isolation. It is a good idea to split the data set into three groups by adding a validation set in addition to the training and test sets. The validation set allows us to fine-tune the model parameters. This is performed before testing the model on non-viewed data. After splitting the data into these groups, if data analysis (EDA) research needs to be performed, it is reasonable to perform it only on the training set.

During working with time series data, a time limitation may be very useful because it will prevent us from getting information after the prediction point.

Also, we should do cross-validation training of the data to minimize data leakage. There is a training set and an independent set that is partitioned from the original data set. These dataset partitions are known as folds. The model being estimated is trained on all but one of the folds. The process is repeated until each fold is used for testing.

Conclusion

Data leakage is a pervasive issue in machine learning. A model is trained with known data and is expected to work with previously unused data. The model must generalize well, then it will have high quality scores. The data leak does not allow the model to do this at an appropriate level and therefore leads to false assumptions about the model's ability to operate. To achieve a reliable and generalizable model, you need to pay special attention to the identification and prevention of data leakage.

 

References:

  1. Bell, J. Machine Learning: Hands-On for Developers and Technical Professionals / J. Bell. – Indianapolis, Indiana : John Wiley & Sons, Inc., 2015, pp 2.
  2. Bengio Y., in Foundations and Trends in Machine Learning 2 /Now Publishers, Boston, 2009, pp. 1–127.
  3. Hilbe, J.M. Logistic Regression Models / J.M. Hilbe : Taylor & Francis Group, LLC, 2009, pp 15.
  4. Logistic Regression with Scikit-learn [Digital resource]:Linear model implementation, 2022. – Access mode: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html (date of application 25.03.2022).