Hitters Salary with Linear Regression
Prediction Study

ANIL NEBİ ŞENTÜRK
11 min readFeb 1, 2022
https://www.trtspor.com.tr/haber/diger-sporlar/olimpiyat-oyunlari/beyzbol-ve-softbolun-umudu-olimpiyat-99067.html

Abstract

In this study, salary estimation was made by using Linear Regression technique from the data of baseball players whose salary information and career statistics of 1986 were shared.
framework and their significance was examined. Thanks to the model obtained at the end of the study, the salary estimation of a player given certain parameters could be estimated at a success rate of 75%. Before the model, the data set can be interpreted correctly and
The steps followed for processing are as follows, respectively;
Exploratory Data Analysis
Correlation Analysis
Feature Engineering
Feature Extraction (Variable Derivation)
Encoding
Model Building
Estimation
Estimation Achievement Test

Data Sets
This dataset was originally taken from the StatLib library at Carnegie Mellon University. The dataset is part of the data used in the 1988 ASA Graphics Section Poster Session. Salary data originally from Sports Illustrated, April 20, 1987. 1986 and career statistics are from the 1987 Baseball Encyclopedia Update, published by Collier Books, Macmillan Publishing Company, New York. Kaggle right now shared over.
It consists of 20 Variables, 6440 Observations in total and is 20.91 KB in size.
The current variables on the data set are as follows;
AtBat: Number of hits with a baseball bat during the 1986–1987 season
Hits: Number of hits in the 1986–1987 season
HmRun: Most valuable hits in the 1986–1987 season
Runs: The points he earned for his team in the 1986–1987 season
Walks: Number of mistakes made by the opposing player
Years: Player’s playing time in major league (years)
CAtBat: Number of hits during a player’s career
CHits: The number of hits the player has taken throughout his career
CRuns: Points earned by the player during his career
CRBI: The number of players the player has made during his career
PutOuts: Helping your teammate in-game
Assists: Number of assists made by the player in the 1986–1987 season
Errors: Player’s number of errors in the 1986–1987 season
Salary: The salary of the player in the 1986–1987 season (over thousand)
NewLeague: A league with A and N levels showing the player’s league at the start of the 1987 season.
factor.

#Library and selection

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=Warning)
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, VotingRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.exceptions import ConvergenceWarning
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, cross_validate
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, RobustScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_auc_score, \
roc_auc_score, confusion_matrix, classification_report, plot_roc_curve, mean_squared_error, mean_absolute_error

from helpers.data_prep import *
from helpers.eda import *

warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Read data and Exploraty Data Analysis

df = pd.read_csv("C:/Users/Lenovo/hitters.csv")
df.head()
df.shape
#322 observations,20 variables

Data Set General Information by using check_df(df)function.

cat_cols,  num_cols, num_but_cat = grab_col_names(df)
num_cols
cat_cols
Observations: 322
Variables: 20
cat_cols: 3
num_cols: 17
cat_but_car: 0
num_but_cat: 0
Out[116]: ['League', 'Division', 'NewLeague']
for col in cat_cols:
cat_summary(df, col)
for col in num_cols:
num_summary(df, col)

count 322.000
mean 380.929
std 153.405
min 16.000
5% 161.000
10% 194.100
20% 221.800
30% 281.300
40% 321.800
50% 379.500
60% 430.200
70% 490.000
80% 535.600
90% 584.900
95% 609.900
99% 658.590
max 687.000
Name: AtBat, dtype: float64
count 322.000
mean 101.025
std 46.455
min 1.000
5% 39.000
10% 45.100
20% 57.000
30% 70.000
40% 82.000
50% 96.000
60% 113.000
70% 128.700
80% 144.000
90% 163.000
95% 174.000
99% 210.790
max 238.000
Name: Hits, dtype: float64
count 322.000
mean 10.770
std 8.709
min 0.000
5% 0.000
10% 1.000
20% 3.000
30% 5.000
40% 6.000
50% 8.000
60% 10.600
70% 14.000
80% 18.000
90% 24.000
95% 28.950
99% 33.000
max 40.000
Name: HmRun, dtype: float64
count 322.000
mean 50.910
std 26.024
min 0.000
5% 17.000
10% 20.100
20% 27.000
30% 33.000
40% 41.000
50% 48.000
60% 55.600
70% 66.000
80% 75.800
90% 89.000
95% 94.000
99% 107.790
max 130.000
Name: Runs, dtype: float64
count 322.000
mean 48.028
std 26.167
min 0.000
5% 14.000
10% 19.000
20% 26.000
30% 30.000
40% 36.400
50% 44.000
60% 49.000
70% 58.000
80% 73.000
90% 86.000
95% 96.000
99% 112.370
max 121.000
Name: RBI, dtype: float64
count 322.000
mean 38.742
std 21.639
min 0.000
5% 11.050
10% 15.000
20% 20.000
30% 24.000
40% 30.000
50% 35.000
60% 40.000
70% 48.700
80% 59.000
90% 69.900
95% 78.000
99% 93.580
max 105.000
Name: Walks, dtype: float64
count 322.000
mean 7.444
std 4.926
min 1.000
5% 1.000
10% 2.000
20% 3.000
30% 4.000
40% 5.000
50% 6.000
60% 7.600
70% 10.000
80% 12.000
90% 15.000
95% 17.000
99% 19.790
max 24.000
Name: Years, dtype: float64
count 322.000
mean 2648.683
std 2324.206
min 19.000
5% 216.500
10% 342.200
20% 667.600
30% 963.400
40% 1402.200
50% 1928.000
60% 2654.000
70% 3365.000
80% 4483.000
90% 6123.400
95% 7125.600
99% 8749.970
max 14053.000
Name: CAtBat, dtype: float64
count 322.000
mean 717.571
std 654.473
min 4.000
5% 54.000
10% 86.000
20% 164.600
30% 238.000
40% 355.800
50% 508.000
60% 708.200
70% 911.100
80% 1229.200
90% 1659.600
95% 2017.150
99% 2500.340
max 4256.000
Name: CHits, dtype: float64
count 322.000
mean 69.491
std 86.266
min 0.000
5% 2.000
10% 4.000
20% 10.200
30% 16.300
40% 27.400
50% 37.500
60% 51.200
70% 72.400
80% 103.800
90% 194.900
95% 265.650
99% 373.120
max 548.000
Name: CHmRun, dtype: float64
count 322.000
mean 358.795
std 334.106
min 1.000
5% 27.000
10% 38.100
20% 80.400
30% 117.300
40% 181.000
50% 247.000
60% 333.000
70% 443.400
80% 627.200
90% 895.700
95% 1032.300
99% 1174.370
max 2165.000
Name: CRuns, dtype: float64
count 322.000
mean 330.118
std 333.220
min 0.000
5% 22.050
10% 34.100
20% 70.200
30% 106.300
40% 149.000
50% 220.500
60% 303.600
70% 379.100
80% 496.200
90% 861.600
95% 1071.750
99% 1310.850
max 1659.000
Name: CRBI, dtype: float64
count 322.000
mean 260.239
std 267.058
min 0.000
5% 15.050
10% 30.100
20% 55.000
30% 82.000
40% 117.400
50% 170.500
60% 226.600
70% 300.700
80% 421.000
90% 643.900
95% 817.600
99% 1139.140
max 1566.000
Name: CWalks, dtype: float64
count 322.000
mean 288.938
std 280.705
min 0.000
5% 33.200
10% 62.200
20% 99.600
30% 132.000
40% 167.000
50% 212.000
60% 257.400
70% 303.000
80% 365.800
90% 685.600
95% 910.650
99% 1301.190
max 1378.000
Name: PutOuts, dtype: float64
count 322.000
mean 106.913
std 136.855
min 0.000
5% 0.000
10% 2.000
20% 5.200
30% 9.000
40% 15.400
50% 39.500
60% 76.800
70% 134.800
80% 210.800
90% 353.300
95% 431.450
99% 478.160
max 492.000
Name: Assists, dtype: float64
count 322.000
mean 8.040
std 6.368
min 0.000
5% 0.000
10% 1.000
20% 3.000
30% 4.000
40% 5.000
50% 6.000
60% 8.000
70% 10.000
80% 13.000
90% 17.000
95% 20.000
99% 25.790
max 32.000
Name: Errors, dtype: float64
count 263.000
mean 535.926
std 451.119
min 67.500
5% 86.600
10% 100.000
20% 155.000
30% 221.000
40% 300.000
50% 425.000
60% 538.000
70% 700.000
80% 809.000
90% 1048.667
95% 1346.000
99% 2032.887
max 2460.000
Name: Salary, dtype: float64

def target_summary_with_num(dataframe, target, numerical_col):
print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

for col in num_cols:
target_summary_with_num(df,"Salary",col)
for col in cat_cols:
target_summary_with_cat(df,"Salary",col)
missing_values_table(df)
missing_values_table(df)
n_miss ratio
Salary 59 18.320

Let’s throw the variable salary, which is the missing observation
df = df.dropna()
df.isnull().values.any()
df.shape

Our number of observations decreased from 322 to 263, when we discarded the missing values.

Lets see the outlier variables

def outlier_thresholds(dataframe, variable, low_quantile=0.10, up_quantile=0.90):
quantile_one = dataframe[variable].quantile(low_quantile)
quantile_three = dataframe[variable].quantile(up_quantile)
interquantile_range = quantile_three - quantile_one
up_limit = quantile_three + 1.5 * interquantile_range
low_limit = quantile_one - 1.5 * interquantile_range
return low_limit, up_limit


def check_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
return True
else:
return False

for col in num_cols:
if col != "Salary":
print(col, check_outlier(df, col))

ACCESSING THE CONSEQUENTIAL VALUES

def grab_outliers(dataframe, col_name, index=False):
low, up = outlier_thresholds(dataframe, col_name)
if dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].shape[0] > 10:
print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].head())
else:
print(dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))])

if index:
outlier_index = dataframe[((dataframe[col_name] < low) | (dataframe[col_name] > up))].index
return outlier_index

grab_outliers(df, "CHits")
grab_outliers(df, "CHmRun")
grab_outliers(df, "CRuns")
grab_outliers(df, "CWalks")

We have reached the outliers values in the 236th and 249th observations, let’s delete them
df = pd.DataFrame(df).drop(index=[236,249])
df = df.reset_index(drop=True)

df.shape
The last number of observations was 263, it went down to 261 because we directly accessed the outliers themselves and flew them with reset_index.

When we examine quantiles, there is skewness in cwalks, putouts, salary.

def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

bd_low,bd_up = outlier_thresholds(df, "Salary")
df[(df["Salary"] < bd_low ) | (df["Salary"] > bd_up)].shape

def remove_outlier(dataframe, col_name):
low_limit, up_limit = outlier_thresholds(dataframe, col_name)
df_without_outliers = dataframe[~((dataframe[col_name] < low_limit) | (dataframe[col_name] > up_limit))]
return df_without_outliers

df = remove_outlier(df, "Salary")

df.shape

When we deleted all the anomalies in the salary, my observation count became 250, let’s suppress the 3 variables that we think are skewed based on the quantiles values.

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

replace_with_thresholds(df, "PutOuts")
replace_with_thresholds(df, "CWalks")
replace_with_thresholds(df, "CRBI")

check_outlier(df, "PutOuts")
check_outlier(df, "CWalks")
check_outlier(df, "CRBI")

three variables will return false for suppressed outliers

# BINARY ENCODING

binary_cols = [col for col in df.columns if df[col].dtype not in ["int64", "float64"]
and df[col].nunique() == 2]
#label encoding
def label_encoder(dataframe, binary_col):
labelencoder = LabelEncoder()
dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
return dataframe
for col in binary_cols:
label_encoder(df, col)

df.head()
def high_correlated_cols(dataframe, plot=False, corr_th=0.70):
corr = dataframe.corr()
cor_matrix = corr.abs()
upper_triangle_matrix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool))
drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
if plot:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize': (15, 15)})
sns.heatmap(corr, cmap="RdBu")
plt.show()
return drop_list
high_correlated_cols(df, plot=True)

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe

df = one_hot_encoder(df, cat_cols, drop_first=True)
check_df(df)
y = df["Salary"]
X = df.drop(["Salary"], axis=1)
## BASE MODEL


models = [('LR', LinearRegression()),
("Ridge", Ridge()),
("Lasso", Lasso()),
("ElasticNet", ElasticNet()),
('KNN', KNeighborsRegressor()),
('CART', DecisionTreeRegressor()),
('RF', RandomForestRegressor()),
('SVR', SVR()),
('GBM', GradientBoostingRegressor()),
("XGBoost", XGBRegressor(objective='reg:squarederror')),
("LightGBM", LGBMRegressor()),
("CatBoost", CatBoostRegressor(verbose=False))]

for name, regressor in models:
rmse = np.mean(np.sqrt(-cross_val_score(regressor, X, y, cv=10, scoring="neg_mean_squared_error")))
print(f"RMSE: {round(rmse, 4)} ({name}) ")




#hiperparametre optimizasyonu
cart_params = {'max_depth': range(1, 20),
"min_samples_split": range(2, 30)}

rf_params = {"max_depth": [5, 8, 15, None],
"max_features": [5, 7, "auto"],
"min_samples_split": [8, 15, 20],
"n_estimators": [200, 500, 1000]}

xgboost_params = {"learning_rate": [0.1, 0.01, 0.01],
"max_depth": [5, 8, 12, 20],
"n_estimators": [100, 200, 300, 500],
"colsample_bytree": [0.5, 0.8, 1]}

lightgbm_params = {"learning_rate": [0.01, 0.1, 0.001],
"n_estimators": [300, 500, 1500],
"colsample_bytree": [0.5, 0.7, 1]}

regressors = [("CART", DecisionTreeRegressor(), cart_params),
("RF", RandomForestRegressor(), rf_params),
('XGBoost', XGBRegressor(objective='reg:squarederror'), xgboost_params),
('LightGBM', LGBMRegressor(), lightgbm_params)]

best_models = {}

for name, regressor, params in regressors:
print(f"########## {name} ##########")
rmse = np.mean(np.sqrt(-cross_val_score(regressor, X, y, cv=10, scoring="neg_mean_squared_error")))
print(f"RMSE: {round(rmse, 4)} ({name}) ")

gs_best = GridSearchCV(regressor, params, cv=3, n_jobs=-1, verbose=False).fit(X, y)

final_model = regressor.set_params(**gs_best.best_params_)
rmse = np.mean(np.sqrt(-cross_val_score(final_model, X, y, cv=10, scoring="neg_mean_squared_error")))
print(f"RMSE (After): {round(rmse, 4)} ({name}) ")

print(f"{name} best params: {gs_best.best_params_}", end="\n\n")

best_models[name] = final_model
# # Stacking & Ensemble Learning
######################################################

voting_reg = VotingRegressor(estimators=[('RF', best_models["RF"]),
('LightGBM', best_models["LightGBM"])])

voting_reg.fit(X, y)


np.mean(np.sqrt(-cross_val_score(voting_reg, X, y, cv=10, scoring="neg_mean_squared_error")))

######################################################
# Prediction for a New Observation
######################################################

X.columns
random_user = X.sample(1, random_state=45)
voting_reg.predict(random_user)
# FEATURE EXTRACTION
df['NEW_HitRatio'] = df['Hits'] / df['AtBat']
df['NEW_RunRatio'] = df['HmRun'] / df['Runs']
df['NEW_CHitRatio'] = df['CHits'] / df['CAtBat']
df['NEW_CRunRatio'] = df['CHmRun'] / df['CRuns']
df['NEW_Avg_AtBat'] = df['CAtBat'] / df['Years']
df['NEW_Avg_Hits'] = df['CHits'] / df['Years']
df['NEW_Avg_HmRun'] = df['CHmRun'] / df['Years']
df['NEW_Avg_Runs'] = df['CRuns'] / df['Years']
df['NEW_Avg_RBI'] = df['CRBI'] / df['Years']
df['NEW_Avg_Walks'] = df['CWalks'] / df['Years']
df.loc[(df["Years"] <= 1), "NEW_YEARS_Cat"] = "Begginer"
df.loc[(df["Years"] > 1) & (df['Years'] <= 4), "NEW_YEARS_Cat"] = "JR"
df.loc[(df["Years"] > 4) & (df['Years'] <= 8), "NEW_YEARS_Cat"] = "Mid"
df.loc[(df["Years"] > 8) & (df['Years'] <= 12), "NEW_YEARS_Cat"] = "Senior"
df.loc[(df["Years"] > 12), "NEW_YEARS_Cat"] = "Expert"
df["NEW_Hit_Class"] = pd.qcut(df['Hits'], 4, labels=['poor','average','star','super_star'])
def high_correlated_cols(dataframe, plot=False, corr_th=0.70):
corr = dataframe.corr()
cor_matrix = corr.abs()
upper_triangle_matrix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool))
drop_list = [col for col in upper_triangle_matrix.columns if any(upper_triangle_matrix[col] > corr_th)]
if plot:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize': (15, 15)})
sns.heatmap(corr, cmap="RdBu")
plt.show()
return drop_list

high_correlated_cols(df, plot=True)
## FINAL MODEL

y = df["Salary"]
X = df.drop(["Salary"], axis=1)

models = [('LR', LinearRegression()),
("Ridge", Ridge()),
("Lasso", Lasso()),
("ElasticNet", ElasticNet()),
('KNN', KNeighborsRegressor()),
('CART', DecisionTreeRegressor()),
('RF', RandomForestRegressor()),
('SVR', SVR()),
('GBM', GradientBoostingRegressor()),
("XGBoost", XGBRegressor(objective='reg:squarederror')),
("LightGBM", LGBMRegressor()),
("CatBoost", CatBoostRegressor(verbose=False))]
num_cols.remove("Salary")
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

for name, regressor in models:
rmse = np.mean(np.sqrt(-cross_val_score(regressor, X, y, cv=10, scoring="neg_mean_squared_error")))
print(f"RMSE: {round(rmse, 4)} ({name}) ")



#hypermarapetre optimizasyonu
cart_params = {'max_depth': range(1, 20),
"min_samples_split": range(2, 30)}

rf_params = {"max_depth": [5, 8, 15, None],
"max_features": [5, 7, "auto"],
"min_samples_split": [8, 15, 20],
"n_estimators": [200, 500, 1000]}

xgboost_params = {"learning_rate": [0.1, 0.01, 0.01],
"max_depth": [5, 8, 12, 20],
"n_estimators": [100, 200, 300, 500],
"colsample_bytree": [0.5, 0.8, 1]}

lightgbm_params = {"learning_rate": [0.01, 0.1, 0.001],
"n_estimators": [300, 500, 1500],
"colsample_bytree": [0.5, 0.7, 1]}

regressors = [("CART", DecisionTreeRegressor(), cart_params),
("RF", RandomForestRegressor(), rf_params),
('XGBoost', XGBRegressor(objective='reg:squarederror'), xgboost_params),
('LightGBM', LGBMRegressor(), lightgbm_params)]

best_models = {}

for name, regressor, params in regressors:
print(f"########## {name} ##########")
rmse = np.mean(np.sqrt(-cross_val_score(regressor, X, y, cv=10, scoring="neg_mean_squared_error")))
print(f"RMSE: {round(rmse, 4)} ({name}) ")

gs_best = GridSearchCV(regressor, params, cv=3, n_jobs=-1, verbose=False).fit(X, y)

final_model = regressor.set_params(**gs_best.best_params_)
rmse = np.mean(np.sqrt(-cross_val_score(final_model, X, y, cv=10, scoring="neg_mean_squared_error")))
print(f"RMSE (After): {round(rmse, 4)} ({name}) ")

print(f"{name} best params: {gs_best.best_params_}", end="\n\n")

best_models[name] = final_model
# # Stacking & Ensemble Learning
######################################################

voting_reg = VotingRegressor(estimators=[('RF', best_models["RF"]),
('LightGBM', best_models["LightGBM"])])

voting_reg.fit(X, y)


np.mean(np.sqrt(-cross_val_score(voting_reg, X, y, cv=10, scoring="neg_mean_squared_error")))

######################################################
# Prediction for a New Observation
######################################################

X.columns
random_user = X.sample(1, random_state=45)
voting_reg.predict(random_user)

--

--