Data Manipulation

  • Generally, the data was very clean. There were no cells with NaN values.
  • Most of the features such as vision, jumping, strength of the players were assigned a number between 0 and 100. But of course, there were some categorical data, some strings in the csv file such as name of the player, club of the player etc. After we decided that we were going to find the relation between some attributes and finishing score, we decided that we did not need the name, club, nationality of the player so we almost eliminated all these features from the data.
  • Also, our initial dataset had almost 70 columns and 18000 rows. Before we started applying some Machine Learning algorithms, we eliminated the variables which are highly correlated with each other. At the end we were left with 12 features and 1 label (finishing).
Correlation values between the attributes
Pairwise Correlations
10
Heatmap of the features

9

  • In addition, since all these variables were assigned a value between 0-100 we did not do rescaling.
  • For different ML techniques, sometime we had to use different versions of the dataset. For example, since Logistic Regression accepts target variable in the form of binary values 0 or 1, we replaced finishing column with “forvet” column which were constructed by giving 0 or 1 to ranges in finishing column. If the “finishing” value is bigger than 70, it had a forvet value as 1. If it has a value less than 70, it had a forvet value as 0.
  • For Decision Tree and Neural Networks, we divided our range into 3; 0-60 got 0 as “forvet” value, 60-80 got 1, 80-100 got 2.
  • Also, since we had a very big dataset, for the sake of simplicity and understandability we created two new datasets which was extracted from the initial dataset.
  • Our datasets;
  1. Version #1 : contains only 300 rows  which were obtained by shuffling the initial dataset and 13 features. Also, instead of finishing column it has forvet column. You can download it from here;fifa_300
  2. Version #2 : contains 300 row and 5 features. Again forvet column instead of finishing.
  3. Version #3 : Contains all the rows of the raw dataset (17994) but has only 13 columns. Again forvet column instead of finishing. fifa_allrows

Since linear regression was not used for classification, we used the dataset with “finishing” column. You can download it from here; fifa_finish

We were experimenting with ML techniques we wanted to see the effect of data size, this is why we have couple of version of the data.