LWR in Machine Learning

Lomanu4 · 10 Май 2025

LWR in Machine Learning

Mathematical theorem and predicting credit card spending with Locally Weighted Regression (LWR)

We’ll explore Locally Weighted Regression in this chapter.

LWR

Locally Weighted Regression (LWR) is a non-parametric regression technique that fits a linear regression model to a dataset by weighting neighboring data points more heavily.

As LWR learns a local set of parameters for each prediction, LWR can capture non-linear relationships in the data.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

LWR approx. non-linear function. Image source:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Key Differences from Linear Regression:

LWR has advantages in approximating non-linear function by using local neighborhood to the query point.

Yet, it is not the best for addressing larger dataset due to its computational cost, or high-dimensional cost functions as the local neighborhood of given query points might not contain sufficient data points truly close to all dimensions.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Comparison of Linear Regression vs LWR

Mathematical Theorem

LWR has same objectives as linear regression, finding parameters θ that can minimize loss function (1).

Yet, its core idea is to assign weights to training points based on their proximity to a query point using weight (ω) based on Gaussian kernel controlled by a bandwidth parameter (τ):

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

The weight: ω denotes (2):

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

where

* x(i): a random training example,
* x: a given query point and,
* ?: bandwidth parameter.

Bandwidth

The bandwidth parameter (τ) controls how fast the weight should fall as input features (i) shifts.

It significantly impacts the fit; small values can lead to overfitting, while large values can result in underfiting.

Now, assuming we have a fixed bandwidth parameter, we see that if we are given a query point x close to a training point x(i), the weight would be close to 1 as the difference is close to 0. Similarly, if it’s too far from the training point, then the weight would be close to 0:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

τ = 1.0(Left), τ = 5.0 (right)

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Cost function

Now, using the same analogy as linear regression, we can denote:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

where W is a matrix notation of w(i):

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

where n is number of input features, and m is the number of training examples.

Hence, applying the equal notation (3),

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

We now prove that the weight vector W differentiates LWR from linear regression as the parameter θ in linear regression is defined:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

LWR is useful when a global linear model is insufficient for the data.

Key takeaways:

* Non-parametric: Retains training data for predictions.
* Local fitting: Creates a unique linear model for each prediction.
* Weighting: Nearby points have a greater influence.
* Bandwidth (τ): Controls the locality of the fit.
* Advantages: Handles non-linear data well with few features.
* Disadvantages: Computationally expensive, doesn’t work well with high-dimensional data, sensitive to outliers.

Simulation

Using the same dataset from the previous article, we’ll predict how input features impact each test point differently to predict transaction amount.

1. Data Preprocessing

a) Choose input features

We’ll take the same approach as

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

, streamlining the number of input features so that LWR can find neighboring data points without generating nan.

It is crucial to choose columns:

* with continuous values,
* seemingly have linear relationship with the target value (in this case, transaction amount):

# <-- loading the dataframe (same as the prev article) -->

df = df[df['is_fraud'] == 0]
df = df[['amount', 'per_capita_income', 'yearly_income', 'credit_limit', 'credit_score']]

b) Filtering outliers

Next, we’ll filter outliers beyond 3 standard deviations from the mean (a threshold determined by project goals and dataset characteristics):

def filter_outliers(df, column, std_threshold) -> pd.DataFrame:
"""Finds outliers in a DataFrame column based on standard deviation."""
mean = df[column].mean()
std = df[column].std()
upper_bound = mean + std_threshold * std
lower_bound = mean - std_threshold * std
filtered_df = df[(df[column] <= upper\_bound) | (df[column] >= lower_bound)]
return filtered_df

df = df.replace(to_replace='NaN', value=0)
df = filter_outliers(df=df, column='amount', std_threshold=3)
df = filter_outliers(df=df, column='per_capita_income', std_threshold=3)
df = filter_outliers(df=df, column='credit_limit', std_threshold=3)
df = filter_outliers(df=df, column='credit_score', std_threshold=3)

c) Taking logarithm

Finally, we’ll take logarithm of the target value amount to mitigate skewed distribution:

df['amount'] = df['amount'] + 1
df['amount_log'] = np.log(df['amount'])
df = df.drop(columns=\['amount'\], axis='columns')
df = df.dropna()

print('\nFinal data frame for LWR: \n', df.head(n=3))

Notice we add 1 to amount to avoid generating negative infinity in amount_log column.

Final DataFrame:

Final data frame for LWR:

per_capita_income yearly_income credit_limit credit_score amount_log

1 18076.0 36853.0 9100.0 834 2.745346

2 16894.0 34449.0 14802.0 686 4.394449

3 26168.0 53350.0 37634.0 685 5.303305

Defining train/test datasets also takes the same process as

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

.

2. Defining LWR class

In the LWR classifier, we’ll leverage its class variable local_weights to store the key, value pair of test point and its corresponding weights by input features.

class LocallyWeightedRegression:
def __init__(self, tau: float):
self.tau = tau
self.X_train = None
self.y_train = None
self.local_weights = dict()

def _gaussian_kernel(self, x0, xi):
return np.exp(-np.sum((x0 - xi)**2) / (2 * self.tau**2))

def fit(self, X_train, y_train, feature_names: list = None):
self.X_train = X_train
self.y_train = y_train.flatten()
self.feature_names = feature_names

def predict(self, X_test: np.ndarray) -> np.ndarray:
if self.X_train is None or self.y_train is None:
raise ValueError("Model not trained. Call fit() first.")

n_tests = X_test.shape[0]
n_trains = self.X\_train.shape[0]
X_train_with_bias = np.c_[np.ones(n_trains), self.X_train]
X_test_with_bias = np.c_[np.ones(n_tests), X_test]
y\_pred = np.zeros(n_tests)

for i, x0 in enumerate(X\_test):
x0_with_bias = X\_test_with_bias
weights = np.array([self._gaussian_kernel(xi, x0) for xi in self.X_train])
self.local_weights.update({ i: weights })

weighted_X_train = (weights[:, np.newaxis] * X\_train\_with\_bias)
weighted_y_train = weights * self.y_train

try:
theta = np.linalg.solve(X_train_with_bias.T @ weighted_X_train, X_train_with_bias.T @ weighted_y_train)
y_pred = np.dot(x0_with_bias, theta) #theta^T * X
except np.lizalg.LinAlgError:
scaled_weights = weights / np.sum(weights) if np.sum(weights) > 0 else np.zeros_like(weights)
y_pred = np.sum(scaled_weights * self.y_train)

return y_pred

def get_local_feature_importance(self, test_point: int, by: str = 'Weight', ascending: bool = True) -> pd.DataFrame:
"""
Estimates local feature importance for a single test point and provides a rough indication of which training instances (and thus their feature values) most influenced the prediction for the given test point.
"""
if self.X_train is None or self.feature_names is None:
raise ValueError("Model not trained or feature names not provided.")

local_weights_of_test_point = self.local_weights[test_point]

importance = np.abs(self.X_train * local_weights_of_test_point[:, np.newaxis]).sum(axis=0)
normalized_importance = importance / np.sum(importance) if np.sum(importance) > 0 else np.zeros_like(importance)

df = pd.DataFrame({
'column_id': [i for i in range(len(self.feature\_names))],
'Feature': self.feature_names,
'Weight': normalized_importance
})
df_sorted = df.sort_values(by=by, ascending=ascending)
return df_sorted

model = LocallyWeightedRegression(tau=0.01)
feature_names = preprocessor.get_feature_names_out()
model.fit(X_train=X_train_processed, y_train=y_train.values, feature_names=feature_names)

y_pred = model.predict(X_test=X_test_processed)
3. Results

Plot local_weights of each input feature by the randomly chosen test points: 50, 51, 110, 122, 123, and 126.

Each test point has completely different weight balance by input features:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Input weights by test point (m=5,000, n=4, tau=0.01)

At some training points, the evaluation scores in the small boxes indicate mediocre performance.

To further improve the model, we can consider:

* Decreasing the bandwidth parameter (τ);

* Increasing the number of training examples;

* Restructuring input features (e.g., potential interdependence between columns like credit_score and credit_limit).

Let us take the test point #50 for reference. We can see the top three features (credit card limit, credit score, and yearly income) correlate with low card spending:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Test point #50 (m=5,000, n=4, tau=0.01)

On the other hand, credit score ranked top for the test point #110, yet it does not appear to increase predicted spending:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Test point #110 (m=5,000, n=4, tau=0.01)

* Time complexity: Training: O(1) + Prediction: O(n²m)

* Space complexity: O(nm)

(m: training example size, n: input feature size, assuming m >>> n)

Conclusion

In LWR, it is handy that we can avoid the explicit specification of parameters beforehand. However, similar to standard linear regression, LWR can become computationally expensive and less effective when dealing with very large and high-dimensional datasets.

As demonstrated in the example, defining mutually independent input features and tuning the bandwidth parameter (τ) remain crucial for optimal prediction.

Author: Kuriko IWAI

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

/

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

/

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

April 4, 2025

Reference

*

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

*

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Footnotes

1. Cf. Linear regression to minimize loss function

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

2. Gaussian kernel

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

3. Proof:

Applying partial derivative to each component in the loss function J:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Hence,

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

LWR in Machine Learning

Lomanu4