机器学习查漏补缺(3)

[E] Why does an ML model’s performance degrade in production?

There are several reasons why a machine learning model's performance might degrade in production:

Data drift: The distribution of the input data changes over time (e.g., customer behavior, market conditions), and the model no longer sees the same data it was trained on.
Concept drift: The underlying relationship between inputs and outputs changes (e.g., new trends, evolving patterns).
Training-serving skew: There may be differences between the data used for training and the data seen in production (e.g., preprocessing discrepancies, feature engineering differences).
Model staleness: The model may become outdated as new data becomes available, and the patterns it learned during training no longer apply.
Infrastructure issues: Bugs, misconfigurations, or different hardware environments in production can introduce subtle errors that weren't present during testing.

Problems with Deploying Large Machine Learning Models

[M] What problems might we run into when deploying large machine learning models?

Latency: Large models can be slow to make predictions, which may not meet the real-time requirements of certain applications (e.g., recommendation systems, autonomous driving).
Resource usage: Large models require significant computational resources (e.g., memory, processing power) and may be difficult to deploy on resource-constrained environments (e.g., mobile devices, edge devices).
Scalability: Deploying large models across many servers or users can lead to scalability issues, particularly if the model needs frequent retraining or updates.
Inference costs: High computational requirements translate into higher inference costs in production, especially in cloud environments.
Model interpretability: Large models, such as deep neural networks, are often less interpretable, making it harder to debug issues or explain predictions.
Serving infrastructure complexity: Deploying large models may require specialized serving infrastructure, like GPUs or distributed systems, increasing operational complexity.

Common MCMC algorithms:

Metropolis-Hastings: A proposal is made for a new state based on the current state, and it is either accepted or rejected based on a probability that depends on the ratio of the probabilities of the current and proposed states.
Gibbs sampling: A special case of MCMC that updates one variable at a time by sampling from its conditional distribution, given the other variables.

MCMC is widely used in Bayesian statistics and machine learning for sampling from posterior distributions.

Sampling from High-Dimensional Data

[M] If you need to sample from high-dimensional data, which sampling method would you choose?

When sampling from high-dimensional data, Markov Chain Monte Carlo (MCMC) methods, such as Hamiltonian Monte Carlo (HMC) or Gibbs sampling, are often preferred. These methods are well-suited for high-dimensional spaces because:

MCMC algorithms can explore complex, multi-dimensional distributions efficiently, particularly when the distribution has high-dimensional dependencies.
Hamiltonian Monte Carlo (HMC) uses gradient information to take larger, more informed steps in high-dimensional spaces, reducing the risk of getting stuck in regions of low probability.

In cases where you need to sample independent, low-variance samples from high-dimensional spaces, importance sampling or rejection sampling can also be considered, although these methods become less effective as dimensionality increases.

Sampling 100K Comments to Label

[M] How would you sample 100K comments to label?

When sampling comments for labeling, the goal is to ensure that the 100K samples are representative of the overall data distribution. Here are some strategies:

Random Sampling: Select 100K comments at random from the pool of 10 million. This ensures every comment has an equal chance of being selected and provides a broad, unbiased sample.
Stratified Sampling: If you suspect that certain user groups or time periods may have different comment behaviors (e.g., some users might post more abusive comments than others, or behavior changes over time), you can stratify the data by user or time and then randomly sample within each stratum to ensure that your sample is representative across these dimensions.
Temporal Sampling: Since the comments span 24 months, consider stratifying the data by time to ensure that comments from all periods are equally represented. This ensures that the model can generalize across different time periods.
User-Based Sampling: Since comments come from 10K users, you might want to ensure that the 100K comments come from a diverse set of users to avoid bias toward frequent users or specific groups.

Best Practice: A combination of stratified and random sampling might be best to ensure representativeness, especially across users and time periods.

Estimating Label Quality from 100K Labeled Comments

[M] Suppose you get back 100K labeled comments from 20 annotators and you want to look at some labels to estimate the quality of the labels. How many labels would you look at? How would you sample them?

To estimate label quality, you should check a subset of the 100K labeled comments. Here’s how to sample and estimate the number:

How many labels to check?
- A common rule of thumb is to inspect around 1-5% of the labeled data. For 100K labeled comments, you could start by reviewing 1,000 to 5,000 labels.
- You can also use statistical sampling to ensure a representative sample. For example, with a confidence level of 95% and a margin of error of 2-5%, a sample size of around 400 to 2,500 might be appropriate depending on the distribution of labels.
How to sample them?
- Random sampling: Select a random subset of labeled comments to get a broad view of the quality across the dataset.
- Annotator-based sampling: Since you have 20 annotators, it's important to check for annotator bias. Randomly sample labels from each annotator to ensure that no single annotator is consistently inaccurate.
- Stratified sampling: If the data has certain natural groupings (e.g., by user, topic, or time period), stratify the sampling to ensure that labels from different groups are represented.
- Disagreement-based sampling: If some comments have been labeled by multiple annotators, focus on comments where there is disagreement between annotators, as these might indicate lower label quality or subjective labeling.

Best Practice: Start with a 1-5% sample of labels and use a mix of random and stratified sampling, with particular attention to annotator consistency.

Problem with Translation Argument

[M] Suppose you work for a news site that historically has translated only 1% of all its articles. Your coworker argues that we should translate more articles into Chinese because translations help with the readership. On average, your translated articles have twice as many views as your non-translated articles. What might be wrong with this argument?

The argument may suffer from selection bias. If the site is translating only 1% of articles, it’s possible that the articles selected for translation are already expected to perform better (e.g., high-interest topics, popular writers, breaking news stories). Therefore, the higher view count may be due to the nature of the articles rather than the translation itself.

To address this issue:

You need to control for factors like the topic, author, and publication time of the articles.
A more robust test would be to randomly select articles for translation and compare the view counts of translated and non-translated articles of similar types.

Determining if Two Sets of Samples Come from the Same Distribution

[M] How to determine whether two sets of samples (e.g., train and test splits) come from the same distribution?

To determine if two sets of samples come from the same distribution, you can use statistical tests or visual methods:

Kolmogorov-Smirnov (K-S) Test: A non-parametric test that compares the cumulative distributions of two datasets. It’s useful for testing if two samples come from the same continuous distribution.
Chi-Square Test: If the data is categorical, you can use the chi-square test to compare the observed frequencies of categories in the two samples.
Mann-Whitney U Test: Another non-parametric test that compares whether the distributions of two independent samples are different.
Jensen-Shannon Divergence: A measure of the similarity between two probability distributions. A small value indicates that the distributions are similar.
Visual Methods: You can plot histograms, KDEs (Kernel Density Estimations), or cumulative distribution plots for both train and test sets to visually inspect whether they come from the same distribution.

[M] How to determine outliers in your data samples? What to do with them?

Methods for detecting outliers:

Z-score or Standard Deviation Method: An outlier can be defined as a data point that is more than a certain number of standard deviations from the mean (e.g., z>3z > 3z>3).
IQR (Interquartile Range): Outliers are data points that lie below Q1−1.5×IQRQ1 - 1.5 \times \text{IQR}Q1−1.5×IQR or above Q3+1.5×IQRQ3 + 1.5 \times \text{IQR}Q3+1.5×IQR, where Q1Q1Q1 and Q3Q3Q3 are the first and third quartiles.
Isolation Forest: An unsupervised learning algorithm that isolates outliers by recursively partitioning data. Points that require fewer splits to isolate are considered outliers.
DBSCAN (Density-Based Clustering): A clustering algorithm that identifies dense regions in the data and treats points that don’t belong to any cluster as outliers.
Visual Methods: Plotting data with scatter plots, box plots, or histograms can help visually identify outliers.

What to do with outliers:

Remove outliers: If they are the result of data entry errors or noise, removing them can improve model performance.
Transform outliers: You can apply transformations (e.g., log transformation) to reduce the impact of extreme values.
Use robust models: Some models, such as decision trees or median-based regression, are less sensitive to outliers.
Imputation: For missing or clearly incorrect values, consider imputing reasonable values based on the rest of the data.