- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
LWR in Machine Learning
Mathematical theorem and predicting credit card spending with Locally Weighted Regression (LWR)
We’ll explore Locally Weighted Regression in this chapter.
LWR
Locally Weighted Regression (LWR) is a non-parametric regression technique that fits a linear regression model to a dataset by weighting neighboring data points more heavily.
As LWR learns a local set of parameters for each prediction, LWR can capture non-linear relationships in the data.
LWR approx. non-linear function. Image source:
Key Differences from Linear Regression:
LWR has advantages in approximating non-linear function by using local neighborhood to the query point.
Yet, it is not the best for addressing larger dataset due to its computational cost, or high-dimensional cost functions as the local neighborhood of given query points might not contain sufficient data points truly close to all dimensions.
Comparison of Linear Regression vs LWR
Mathematical Theorem
LWR has same objectives as linear regression, finding parameters θ that can minimize loss function (1).
Yet, its core idea is to assign weights to training points based on their proximity to a query point using weight (ω) based on Gaussian kernel controlled by a bandwidth parameter (τ):
The weight: ω denotes (2):
where
The bandwidth parameter (τ) controls how fast the weight should fall as input features (i) shifts.
It significantly impacts the fit; small values can lead to overfitting, while large values can result in underfiting.
Now, assuming we have a fixed bandwidth parameter, we see that if we are given a query point x close to a training point x(i), the weight would be close to 1 as the difference is close to 0. Similarly, if it’s too far from the training point, then the weight would be close to 0:
τ = 1.0(Left), τ = 5.0 (right)
Cost function
Now, using the same analogy as linear regression, we can denote:
where W is a matrix notation of w(i):
where n is number of input features, and m is the number of training examples.
Hence, applying the equal notation (3),
We now prove that the weight vector W differentiates LWR from linear regression as the parameter θ in linear regression is defined:
LWR is useful when a global linear model is insufficient for the data.
Key takeaways:
Using the same dataset from the previous article, we’ll predict how input features impact each test point differently to predict transaction amount.
1. Data Preprocessing
a) Choose input features
We’ll take the same approach as , streamlining the number of input features so that LWR can find neighboring data points without generating nan.
It is crucial to choose columns:
# <-- loading the dataframe (same as the prev article) -->
df = df[df['is_fraud'] == 0]
df = df[['amount', 'per_capita_income', 'yearly_income', 'credit_limit', 'credit_score']]
b) Filtering outliers
Next, we’ll filter outliers beyond 3 standard deviations from the mean (a threshold determined by project goals and dataset characteristics):
def filter_outliers(df, column, std_threshold) -> pd.DataFrame:
"""Finds outliers in a DataFrame column based on standard deviation."""
mean = df[column].mean()
std = df[column].std()
upper_bound = mean + std_threshold * std
lower_bound = mean - std_threshold * std
filtered_df = df[(df[column] <= upper\_bound) | (df[column] >= lower_bound)]
return filtered_df
df = df.replace(to_replace='NaN', value=0)
df = filter_outliers(df=df, column='amount', std_threshold=3)
df = filter_outliers(df=df, column='per_capita_income', std_threshold=3)
df = filter_outliers(df=df, column='credit_limit', std_threshold=3)
df = filter_outliers(df=df, column='credit_score', std_threshold=3)
c) Taking logarithm
Finally, we’ll take logarithm of the target value amount to mitigate skewed distribution:
df['amount'] = df['amount'] + 1
df['amount_log'] = np.log(df['amount'])
df = df.drop(columns=\['amount'\], axis='columns')
df = df.dropna()
print('\nFinal data frame for LWR: \n', df.head(n=3))
Notice we add 1 to amount to avoid generating negative infinity in amount_log column.
Final DataFrame:
Final data frame for LWR:
per_capita_income yearly_income credit_limit credit_score amount_log
1 18076.0 36853.0 9100.0 834 2.745346
2 16894.0 34449.0 14802.0 686 4.394449
3 26168.0 53350.0 37634.0 685 5.303305
Defining train/test datasets also takes the same process as .
2. Defining LWR class
In the LWR classifier, we’ll leverage its class variable local_weights to store the key, value pair of test point and its corresponding weights by input features.
class LocallyWeightedRegression:
def __init__(self, tau: float):
self.tau = tau
self.X_train = None
self.y_train = None
self.local_weights = dict()
def _gaussian_kernel(self, x0, xi):
return np.exp(-np.sum((x0 - xi)**2) / (2 * self.tau**2))
def fit(self, X_train, y_train, feature_names: list = None):
self.X_train = X_train
self.y_train = y_train.flatten()
self.feature_names = feature_names
def predict(self, X_test: np.ndarray) -> np.ndarray:
if self.X_train is None or self.y_train is None:
raise ValueError("Model not trained. Call fit() first.")
n_tests = X_test.shape[0]
n_trains = self.X\_train.shape[0]
X_train_with_bias = np.c_[np.ones(n_trains), self.X_train]
X_test_with_bias = np.c_[np.ones(n_tests), X_test]
y\_pred = np.zeros(n_tests)
for i, x0 in enumerate(X\_test):
x0_with_bias = X\_test_with_bias
weights = np.array([self._gaussian_kernel(xi, x0) for xi in self.X_train])
self.local_weights.update({ i: weights })
weighted_X_train = (weights[:, np.newaxis] * X\_train\_with\_bias)
weighted_y_train = weights * self.y_train
try:
theta = np.linalg.solve(X_train_with_bias.T @ weighted_X_train, X_train_with_bias.T @ weighted_y_train)
y_pred = np.dot(x0_with_bias, theta) #theta^T * X
except np.lizalg.LinAlgError:
scaled_weights = weights / np.sum(weights) if np.sum(weights) > 0 else np.zeros_like(weights)
y_pred = np.sum(scaled_weights * self.y_train)
return y_pred
def get_local_feature_importance(self, test_point: int, by: str = 'Weight', ascending: bool = True) -> pd.DataFrame:
"""
Estimates local feature importance for a single test point and provides a rough indication of which training instances (and thus their feature values) most influenced the prediction for the given test point.
"""
if self.X_train is None or self.feature_names is None:
raise ValueError("Model not trained or feature names not provided.")
local_weights_of_test_point = self.local_weights[test_point]
importance = np.abs(self.X_train * local_weights_of_test_point[:, np.newaxis]).sum(axis=0)
normalized_importance = importance / np.sum(importance) if np.sum(importance) > 0 else np.zeros_like(importance)
df = pd.DataFrame({
'column_id': [i for i in range(len(self.feature\_names))],
'Feature': self.feature_names,
'Weight': normalized_importance
})
df_sorted = df.sort_values(by=by, ascending=ascending)
return df_sorted
model = LocallyWeightedRegression(tau=0.01)
feature_names = preprocessor.get_feature_names_out()
model.fit(X_train=X_train_processed, y_train=y_train.values, feature_names=feature_names)
y_pred = model.predict(X_test=X_test_processed)
3. Results
Plot local_weights of each input feature by the randomly chosen test points: 50, 51, 110, 122, 123, and 126.
Each test point has completely different weight balance by input features:
Input weights by test point (m=5,000, n=4, tau=0.01)
At some training points, the evaluation scores in the small boxes indicate mediocre performance.
To further improve the model, we can consider:
Let us take the test point #50 for reference. We can see the top three features (credit card limit, credit score, and yearly income) correlate with low card spending:
Test point #50 (m=5,000, n=4, tau=0.01)
On the other hand, credit score ranked top for the test point #110, yet it does not appear to increase predicted spending:
Test point #110 (m=5,000, n=4, tau=0.01)
(m: training example size, n: input feature size, assuming m >>> n)
Conclusion
In LWR, it is handy that we can avoid the explicit specification of parameters beforehand. However, similar to standard linear regression, LWR can become computationally expensive and less effective when dealing with very large and high-dimensional datasets.
As demonstrated in the example, defining mutually independent input features and tuning the bandwidth parameter (τ) remain crucial for optimal prediction.
Author: Kuriko IWAI
/ /
April 4, 2025
Reference
1. Cf. Linear regression to minimize loss function
2. Gaussian kernel
3. Proof:
Applying partial derivative to each component in the loss function J:
Hence,
Mathematical theorem and predicting credit card spending with Locally Weighted Regression (LWR)
We’ll explore Locally Weighted Regression in this chapter.
LWR
Locally Weighted Regression (LWR) is a non-parametric regression technique that fits a linear regression model to a dataset by weighting neighboring data points more heavily.
As LWR learns a local set of parameters for each prediction, LWR can capture non-linear relationships in the data.
LWR approx. non-linear function. Image source:
Key Differences from Linear Regression:
LWR has advantages in approximating non-linear function by using local neighborhood to the query point.
Yet, it is not the best for addressing larger dataset due to its computational cost, or high-dimensional cost functions as the local neighborhood of given query points might not contain sufficient data points truly close to all dimensions.
Comparison of Linear Regression vs LWR
Mathematical Theorem
LWR has same objectives as linear regression, finding parameters θ that can minimize loss function (1).
Yet, its core idea is to assign weights to training points based on their proximity to a query point using weight (ω) based on Gaussian kernel controlled by a bandwidth parameter (τ):
The weight: ω denotes (2):
where
- * x(i): a random training example,
- * x: a given query point and,
- * ?: bandwidth parameter.
The bandwidth parameter (τ) controls how fast the weight should fall as input features (i) shifts.
It significantly impacts the fit; small values can lead to overfitting, while large values can result in underfiting.
Now, assuming we have a fixed bandwidth parameter, we see that if we are given a query point x close to a training point x(i), the weight would be close to 1 as the difference is close to 0. Similarly, if it’s too far from the training point, then the weight would be close to 0:
τ = 1.0(Left), τ = 5.0 (right)
Cost function
Now, using the same analogy as linear regression, we can denote:
where W is a matrix notation of w(i):
where n is number of input features, and m is the number of training examples.
Hence, applying the equal notation (3),
We now prove that the weight vector W differentiates LWR from linear regression as the parameter θ in linear regression is defined:
LWR is useful when a global linear model is insufficient for the data.
Key takeaways:
- * Non-parametric: Retains training data for predictions.
- * Local fitting: Creates a unique linear model for each prediction.
- * Weighting: Nearby points have a greater influence.
- * Bandwidth (τ): Controls the locality of the fit.
- * Advantages: Handles non-linear data well with few features.
- * Disadvantages: Computationally expensive, doesn’t work well with high-dimensional data, sensitive to outliers.
Using the same dataset from the previous article, we’ll predict how input features impact each test point differently to predict transaction amount.
1. Data Preprocessing
a) Choose input features
We’ll take the same approach as , streamlining the number of input features so that LWR can find neighboring data points without generating nan.
It is crucial to choose columns:
- * with continuous values,
- * seemingly have linear relationship with the target value (in this case, transaction amount):
# <-- loading the dataframe (same as the prev article) -->
df = df[df['is_fraud'] == 0]
df = df[['amount', 'per_capita_income', 'yearly_income', 'credit_limit', 'credit_score']]
b) Filtering outliers
Next, we’ll filter outliers beyond 3 standard deviations from the mean (a threshold determined by project goals and dataset characteristics):
def filter_outliers(df, column, std_threshold) -> pd.DataFrame:
"""Finds outliers in a DataFrame column based on standard deviation."""
mean = df[column].mean()
std = df[column].std()
upper_bound = mean + std_threshold * std
lower_bound = mean - std_threshold * std
filtered_df = df[(df[column] <= upper\_bound) | (df[column] >= lower_bound)]
return filtered_df
df = df.replace(to_replace='NaN', value=0)
df = filter_outliers(df=df, column='amount', std_threshold=3)
df = filter_outliers(df=df, column='per_capita_income', std_threshold=3)
df = filter_outliers(df=df, column='credit_limit', std_threshold=3)
df = filter_outliers(df=df, column='credit_score', std_threshold=3)
c) Taking logarithm
Finally, we’ll take logarithm of the target value amount to mitigate skewed distribution:
df['amount'] = df['amount'] + 1
df['amount_log'] = np.log(df['amount'])
df = df.drop(columns=\['amount'\], axis='columns')
df = df.dropna()
print('\nFinal data frame for LWR: \n', df.head(n=3))
Notice we add 1 to amount to avoid generating negative infinity in amount_log column.
Final DataFrame:
Final data frame for LWR:
per_capita_income yearly_income credit_limit credit_score amount_log
1 18076.0 36853.0 9100.0 834 2.745346
2 16894.0 34449.0 14802.0 686 4.394449
3 26168.0 53350.0 37634.0 685 5.303305
Defining train/test datasets also takes the same process as .
2. Defining LWR class
In the LWR classifier, we’ll leverage its class variable local_weights to store the key, value pair of test point and its corresponding weights by input features.
class LocallyWeightedRegression:
def __init__(self, tau: float):
self.tau = tau
self.X_train = None
self.y_train = None
self.local_weights = dict()
def _gaussian_kernel(self, x0, xi):
return np.exp(-np.sum((x0 - xi)**2) / (2 * self.tau**2))
def fit(self, X_train, y_train, feature_names: list = None):
self.X_train = X_train
self.y_train = y_train.flatten()
self.feature_names = feature_names
def predict(self, X_test: np.ndarray) -> np.ndarray:
if self.X_train is None or self.y_train is None:
raise ValueError("Model not trained. Call fit() first.")
n_tests = X_test.shape[0]
n_trains = self.X\_train.shape[0]
X_train_with_bias = np.c_[np.ones(n_trains), self.X_train]
X_test_with_bias = np.c_[np.ones(n_tests), X_test]
y\_pred = np.zeros(n_tests)
for i, x0 in enumerate(X\_test):
x0_with_bias = X\_test_with_bias
weights = np.array([self._gaussian_kernel(xi, x0) for xi in self.X_train])
self.local_weights.update({ i: weights })
weighted_X_train = (weights[:, np.newaxis] * X\_train\_with\_bias)
weighted_y_train = weights * self.y_train
try:
theta = np.linalg.solve(X_train_with_bias.T @ weighted_X_train, X_train_with_bias.T @ weighted_y_train)
y_pred = np.dot(x0_with_bias, theta) #theta^T * X
except np.lizalg.LinAlgError:
scaled_weights = weights / np.sum(weights) if np.sum(weights) > 0 else np.zeros_like(weights)
y_pred = np.sum(scaled_weights * self.y_train)
return y_pred
def get_local_feature_importance(self, test_point: int, by: str = 'Weight', ascending: bool = True) -> pd.DataFrame:
"""
Estimates local feature importance for a single test point and provides a rough indication of which training instances (and thus their feature values) most influenced the prediction for the given test point.
"""
if self.X_train is None or self.feature_names is None:
raise ValueError("Model not trained or feature names not provided.")
local_weights_of_test_point = self.local_weights[test_point]
importance = np.abs(self.X_train * local_weights_of_test_point[:, np.newaxis]).sum(axis=0)
normalized_importance = importance / np.sum(importance) if np.sum(importance) > 0 else np.zeros_like(importance)
df = pd.DataFrame({
'column_id': [i for i in range(len(self.feature\_names))],
'Feature': self.feature_names,
'Weight': normalized_importance
})
df_sorted = df.sort_values(by=by, ascending=ascending)
return df_sorted
model = LocallyWeightedRegression(tau=0.01)
feature_names = preprocessor.get_feature_names_out()
model.fit(X_train=X_train_processed, y_train=y_train.values, feature_names=feature_names)
y_pred = model.predict(X_test=X_test_processed)
3. Results
Plot local_weights of each input feature by the randomly chosen test points: 50, 51, 110, 122, 123, and 126.
Each test point has completely different weight balance by input features:
Input weights by test point (m=5,000, n=4, tau=0.01)
At some training points, the evaluation scores in the small boxes indicate mediocre performance.
To further improve the model, we can consider:
- * Decreasing the bandwidth parameter (τ);
- * Increasing the number of training examples;
- * Restructuring input features (e.g., potential interdependence between columns like credit_score and credit_limit).
Let us take the test point #50 for reference. We can see the top three features (credit card limit, credit score, and yearly income) correlate with low card spending:
Test point #50 (m=5,000, n=4, tau=0.01)
On the other hand, credit score ranked top for the test point #110, yet it does not appear to increase predicted spending:
Test point #110 (m=5,000, n=4, tau=0.01)
- * Time complexity: Training: O(1) + Prediction: O(n²m)
- * Space complexity: O(nm)
(m: training example size, n: input feature size, assuming m >>> n)
Conclusion
In LWR, it is handy that we can avoid the explicit specification of parameters beforehand. However, similar to standard linear regression, LWR can become computationally expensive and less effective when dealing with very large and high-dimensional datasets.
As demonstrated in the example, defining mutually independent input features and tuning the bandwidth parameter (τ) remain crucial for optimal prediction.
Author: Kuriko IWAI
/ /
April 4, 2025
Reference
- *
- *
1. Cf. Linear regression to minimize loss function
2. Gaussian kernel
3. Proof:
Applying partial derivative to each component in the loss function J:
Hence,