机器学习查漏补缺(4)

[M] What happens if we accidentally duplicate every data point in your train set or in your test set?

Train set duplication: Duplicating every data point in the training set will effectively double the importance of each sample, but it won’t introduce new information. This can lead to overfitting, as the model may give too much importance to specific patterns, thinking they are more frequent than they really are. It can also slow down training without improving performance.
Test set duplication: Duplicating data points in the test set won't directly affect the evaluation metrics (e.g., accuracy, F1-score) since duplicated predictions would just repeat. However, it can lead to inflated test scores if the model happens to perform well on the duplicated samples, thus misrepresenting the true generalization ability.

Summary: Duplication in the training set can cause overfitting, while duplication in the test set can lead to misleading performance metrics.

[E] How would class imbalance affect your model?

Class imbalance can lead to a model that is biased toward the majority class, ignoring the minority class. The model may:

Produce a high overall accuracy by predicting the majority class most of the time, but perform poorly in identifying the minority class.
Generate more false negatives for the minority class and may not generalize well to unseen data.

Summary: Class imbalance can skew predictions, leading to poor performance on the minority class.

[E] Why is it hard for ML models to perform well on data with class imbalance?

It is hard because:

Bias toward majority class: Most optimization algorithms, like those minimizing loss functions (e.g., cross-entropy), aim to reduce overall error. This can cause the model to predict the majority class most of the time and ignore the minority class.
Lack of representative data: With few examples of the minority class, the model doesn’t have enough information to learn the patterns and characteristics of that class.
Evaluation metrics: Metrics like accuracy can be misleading in imbalanced datasets, as they don’t account for the imbalance and may indicate good performance even when the model performs poorly on the minority class.

[M] Techniques to improve a model for detecting skin lesions when only 1% of the images show lesions:

Resampling:
- Oversample the minority class: Duplicate or generate synthetic samples (e.g., using SMOTE) for the minority class.
- Undersample the majority class: Reduce the number of majority class samples to balance the dataset.
Use weighted loss functions: Assign a higher penalty to misclassifications of the minority class by adjusting the loss function, so the model pays more attention to the minority class.
Anomaly detection methods: Given the rarity of lesions, treat the problem as an anomaly detection task, focusing on identifying outliers (i.e., lesion images) among the normal cases.
Data augmentation: Augment the minority class data by applying transformations (e.g., rotations, zooms) to increase its representation without affecting the true distribution.

Sample Duplication

[M] When should you remove duplicate training samples? When shouldn’t you?

When to remove duplicates: You should remove duplicate samples if:
- The duplicates are artifacts of data collection or preprocessing errors (e.g., accidental repeated entries).
- You have a large dataset and duplicates are not expected to provide new information, as they could lead to overfitting and biased models.
- You want to avoid giving disproportionate importance to certain samples, which could skew model training.
When not to remove duplicates: Duplicates should not be removed if:
- They represent genuine repeated events (e.g., customers making the same purchase repeatedly, medical data for a patient with recurring symptoms).
- Duplicates are part of the natural data distribution and provide useful information about the frequency of certain patterns.

Summary: Duplicates should be removed when they’re artifacts but retained if they reflect true data patterns.

[M] What happens if we accidentally duplicate every data point in your train set or in your test set?

Train set duplication: Duplicating every data point in the training set will effectively double the importance of each sample, but it won’t introduce new information. This can lead to overfitting, as the model may give too much importance to specific patterns, thinking they are more frequent than they really are. It can also slow down training without improving performance.
Test set duplication: Duplicating data points in the test set won't directly affect the evaluation metrics (e.g., accuracy, F1-score) since duplicated predictions would just repeat. However, it can lead to inflated test scores if the model happens to perform well on the duplicated samples, thus misrepresenting the true generalization ability.

Summary: Duplication in the training set can cause overfitting, while duplication in the test set can lead to misleading performance metrics.

Handling Missing Data

[H] In your dataset, two out of 20 variables have more than 30% missing values. What would you do?

Several strategies can be applied depending on the importance of the variables and the nature of the missing data:

Remove the variables: If the two variables are not critical for the model's performance or are highly correlated with other variables, consider dropping them from the dataset.
Imputation:
- Simple imputation: Fill missing values with the mean, median, or mode if the variables are continuous, or the most frequent category for categorical variables.
- Advanced imputation: Use machine learning techniques like k-nearest neighbors (KNN) imputation or a predictive model to impute missing values based on other variables.
Use a separate model: If the missing data contains useful information (e.g., a variable indicating the absence of certain features), you can use it to build a separate model or introduce an indicator variable to flag missing entries.

Best Practice: Assess the importance of the variables and how much information might be lost if the variables are removed or imputed.

[M] How might techniques that handle missing data make selection bias worse? How do you handle this bias?

Techniques like imputation can worsen selection bias if missing values are not randomly distributed (i.e., if the missingness is dependent on the underlying data). For example:

Mean imputation can artificially shrink the variance, leading to biased estimates.
Non-random missingness (e.g., patients with severe conditions have more missing values) can introduce bias in predictions if not handled properly.

To handle this bias:

Analyze the pattern of missingness: Determine if data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), and choose the appropriate imputation method.
Use missing indicators: Add a binary indicator variable for missingness to allow the model to account for the presence or absence of values.
Check for bias: After handling missing data, evaluate your model for any potential bias or skew in predictions due to imputation.

Randomization in Experimental Design

[M] Why is randomization important when designing experiments (experimental design)?

Randomization is important because it:

Reduces bias: It ensures that confounding variables are distributed randomly across treatment groups, preventing systematic differences.
Enables causal inference: By randomly assigning participants to different groups, we can attribute differences in outcomes to the treatment or intervention rather than other factors.
Promotes generalizability: Randomization makes it more likely that the sample is representative of the population, leading to more reliable and generalizable results.

Summary: Randomization controls for confounding variables and ensures valid causal inferences.

Class Imbalance

[E] How would class imbalance affect your model?

Class imbalance can lead to a model that is biased toward the majority class, ignoring the minority class. The model may:

Produce a high overall accuracy by predicting the majority class most of the time, but perform poorly in identifying the minority class.
Generate more false negatives for the minority class and may not generalize well to unseen data.

Summary: Class imbalance can skew predictions, leading to poor performance on the minority class.

[E] Why is it hard for ML models to perform well on data with class imbalance?

It is hard because:

Bias toward majority class: Most optimization algorithms, like those minimizing loss functions (e.g., cross-entropy), aim to reduce overall error. This can cause the model to predict the majority class most of the time and ignore the minority class.
Lack of representative data: With few examples of the minority class, the model doesn’t have enough information to learn the patterns and characteristics of that class.
Evaluation metrics: Metrics like accuracy can be misleading in imbalanced datasets, as they don’t account for the imbalance and may indicate good performance even when the model performs poorly on the minority class.

[M] Techniques to improve a model for detecting skin lesions when only 1% of the images show lesions:

Resampling:
- Oversample the minority class: Duplicate or generate synthetic samples (e.g., using SMOTE) for the minority class.
- Undersample the majority class: Reduce the number of majority class samples to balance the dataset.
Use weighted loss functions: Assign a higher penalty to misclassifications of the minority class by adjusting the loss function, so the model pays more attention to the minority class.
Anomaly detection methods: Given the rarity of lesions, treat the problem as an anomaly detection task, focusing on identifying outliers (i.e., lesion images) among the normal cases.
Data augmentation: Augment the minority class data by applying transformations (e.g., rotations, zooms) to increase its representation without affecting the true distribution.

Training Data Leakage

[M] If you oversample the rare class and then split your data into train and test splits, why does your model perform well on the test split but poorly in production?

By oversampling the rare class before splitting the data, duplicates of minority samples could end up in both the train and test sets. This causes data leakage: the model has already seen the test data during training, so it performs well on the test set. However, in production, when the model encounters new data, it may fail to generalize because it was effectively overfitting to the duplicated samples.

Solution: Always split your data into train and test sets before applying oversampling to prevent data leakage.

[M] How could randomly splitting the data lead to data leakage in the spam classification task?

In the spam classification example, splitting the data randomly could lead to time-based leakage. Since the data spans 7 days, a random split could place comments from the same user or highly similar comments (e.g., responses to the same topic) in both the training and test sets. This would cause the model to see near-identical examples during training and testing, resulting in artificially high performance.

Solution: Use a time-based split, ensuring that the training data comes from an earlier time period than the test data to avoid leakage.

[M] How does data sparsity affect your models?

Data sparsity refers to the situation where many feature values are zero or missing, but it is different from missing data. Sparse data often arises in high-dimensional datasets, such as text or recommendation systems, where each data point contains only a small number of non-zero values. Sparsity can affect models in several ways:

Harder to learn patterns: With sparse data, it can be difficult for models to learn meaningful patterns, as the relevant information is spread out and most features are zero.
Increased computational cost: Processing large sparse matrices requires more memory and computational resources, even if most of the values are zero.
Overfitting: Sparse data increases the chances of overfitting, especially for complex models, because there is less dense information for the model to learn from.
Decreased generalization: Models trained on sparse data may not generalize well to new data because the training data doesn't provide enough dense information for the model to effectively learn patterns.

Handling sparsity: Techniques like feature engineering, dimensionality reduction, or using specialized algorithms (e.g., matrix factorization, embeddings) can help reduce the negative effects of data sparsity.

[E] What’s the bias-variance trade-off?

The bias-variance trade-off describes the balance between two sources of error that affect model performance:

Bias: Error introduced by assuming too simple a model (underfitting). High bias means the model makes strong assumptions about the data and cannot capture its complexity.
Variance: Error introduced by the model being too sensitive to small fluctuations in the training data (overfitting). High variance means the model fits the training data well but does not generalize to new data.

The goal is to find a balance where both bias and variance are minimized for good generalization.

Loss Curves for Overfitting and Underfitting

[E] Draw the loss curves for overfitting and underfitting:

In an overfitting scenario, the model fits the training data very well but performs poorly on unseen data. In contrast, in underfitting, the model performs poorly on both the training and test data because it fails to capture the underlying patterns. Here’s what the curves typically look like:

Overfitting:
- Training loss decreases and stays low as training progresses.
- Validation/test loss decreases at first but starts increasing again after some point, indicating overfitting.
Underfitting:
- Training loss remains high and does not decrease much.
- Validation/test loss also remains high, showing that the model is not learning the data well enough.

Bias-Variance Trade-Off

[E] What’s the bias-variance trade-off?

The bias-variance trade-off describes the balance between two sources of error that affect model performance:

Bias: Error introduced by assuming too simple a model (underfitting). High bias means the model makes strong assumptions about the data and cannot capture its complexity.
Variance: Error introduced by the model being too sensitive to small fluctuations in the training data (overfitting). High variance means the model fits the training data well but does not generalize to new data.

The goal is to find a balance where both bias and variance are minimized for good generalization.

[M] How’s this tradeoff related to overfitting and underfitting?

Overfitting occurs when a model has low bias and high variance. It fits the training data very well but fails to generalize to unseen data.
Underfitting occurs when a model has high bias and low variance. The model is too simple to capture the patterns in the data and performs poorly on both training and test sets.

[M] Your model’s loss curves on the train, valid, and test sets look like this. What might have been the cause of this? What would you do?

If the training loss is decreasing, but the validation loss increases after a point, the model is likely overfitting. To address this:

Use regularization (L2, dropout, or data augmentation).
Stop training earlier using early stopping to prevent the model from overfitting.
Get more data to improve generalization.

If the training and validation losses are both high, the model is likely underfitting. To fix this: