yall.querystrategies¶
-
class
yall.querystrategies.
UncertaintySampler
(model_change=False)[source]¶ Bases:
yall.querystrategies.QueryStrategy
-
choose
(scores)[source]¶ Parameters: scores (numpy.ndarray) – Output of self.score() Returns: Index of chosen example. Return type: int
-
model_change_wrapper
(score_func)[source]¶ Model change wrapper around the scoring function. See doc for __score() above for usage insructions.
\(score_{mc}(X) = score(X; t) - w_o score(X; t-1)\)
\(score(X, t)\): The score at time t
\(w_o = \frac{1}{\mid L \mid}\)
Parameters: score_func (function) – Scoring function to wrap. Returns: Wrapped scoring function. Return type: function
-
-
class
yall.querystrategies.
CombinedSampler
(qs1=None, qs2=None, beta=1, choice_metric=<function argmax>)[source]¶ Bases:
yall.querystrategies.QueryStrategy
Allows one sampler’s scores to be weighted by anothers according to the equation:
\(score(x) = score_{qs1}(x) \times score_{qs2}(x)^{\beta}\)
Assumes \(x^* = argmax(score)\)
Parameters: - qs1 (QueryStrategy) – Main query strategy.
- qs2 (QueryStrategy) – Query strategy to use as weight.
- beta (float) – Scale factor for score_qs2.
- choice_metric (function) – Function that takes a 1d np.array and returns a chosen index.
-
class
yall.querystrategies.
DistDivSampler
(qs1=None, qs2=None, lam=0.5, choice_metric=<function argmax>)[source]¶ Bases:
yall.querystrategies.QueryStrategy
Combined sampling method as in “Active learning for clinical text classification: is it better than random sampling?”
\(x^* = argmin_x (\lambda score_{qs1}(x) + (1 - \lambda) score_{qs2}(x))\)
Parameters: - qs1 (QueryStrategy) – Uncertainty sampling query strategy.
- qs2 (QueryStrategy) – Representative sampling query strategy.
- lambda (float) – Query strategy weight [0,1] or “dynamic”.
- choice_metric (function) – Function that takes a 1d np.array and returns a chosen index.
-
class
yall.querystrategies.
Random
[source]¶ Bases:
yall.querystrategies.QueryStrategy
Random query strategy. Equivalent to passive learning.
-
class
yall.querystrategies.
SimpleMargin
[source]¶ Bases:
yall.querystrategies.QueryStrategy
Finds the example x that is closest to the separating hyperplane.
\(x^* = argmin_x |f(x)|\)
-
choose
(scores)[source]¶ Returns the example with the shortest distance to the hyperplane. In the multiclass case, his will return the row index of the example with the smallest absolute distance to any hyperplane. Could be modified to choose the smallest average distance to all hyperplanes. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int
-
-
class
yall.querystrategies.
Margin
[source]¶ Bases:
yall.querystrategies.QueryStrategy
Margin Sampler. Chooses the member from the unlabeled set with the smallest difference between the posterior probabilities of the two most probable class labels.
\(x^* = argmin_x P(\hat{y_1}|x) - P(\hat{y_2}|x)\)
- where \(\hat{y_1}\) is the most probable label
- and \(\hat{y_2}\) is the second most probable label.
-
class
yall.querystrategies.
Entropy
(model_change=False)[source]¶ Bases:
yall.querystrategies.UncertaintySampler
Entropy Sampler. Chooses the member from the unlabeled set with the greatest entropy across possible labels.
\(x^* = argmax_x -\sum_i P(y_i|x) \times log_2(P(y_i|x))\)
-
class
yall.querystrategies.
LeastConfidence
(model_change=False)[source]¶ Bases:
yall.querystrategies.UncertaintySampler
Least confidence (uncertainty sampling). Chooses the member from the unlabeled set with the greatest uncertainty, i.e. the greatest posterior probability of all labels except the most likely one.
\(x^* = argmax_x 1 - P(\hat{y}|x)\)
where \(\hat{y} = argmax_y P(y|x)\)
-
class
yall.querystrategies.
LeastConfidenceBias
(model_change=False)[source]¶ Bases:
yall.querystrategies.UncertaintySampler
Least confidence with bias. This is the same as least confidence, but moves the decision boundary according to the current class distribution.
\[x^* = \Biggl \lbrace { \frac{P(\hat{y}|x)}{P_{max}}, \text{ if } {P(\hat{y}|x) < P_{max}} \atop \frac{1 - P(\hat{y}|x)}{P_{max}}, \text{ otherwise } }\]where
\(P_{max} = mean(0.5, 1 - pp)\) and \(pp\) is the percentage of positive examples in the labeled set.
-
class
yall.querystrategies.
LeastConfidenceDynamicBias
(model_change=False)[source]¶ Bases:
yall.querystrategies.UncertaintySampler
Least confidence with dynamic bias. This is the same as least confidence with bias, but the bias also adjusts for the relative sizes of the labeled and unlabeled data sets.
\[x^* = \Biggl \lbrace { \frac{P(\hat{y}|x)}{P_{max}}, \text{ if } {P(\hat{y}|x) < P_{max}} \atop \frac{1 - P(\hat{y}|x)}{P_{max}}, \text{ otherwise } }\]where
\(P_{max} = (1 - pp)w_b + 0.5w_y\)
\(pp\) is the percentage of positive examples in the labeled set.
\(w_u = \frac{|L|}{U_0}\) and \(U_0\) is the initial unlabeled set.
\(w_b = 1 - w_u\)
-
class
yall.querystrategies.
DistanceToCenter
(metric='euclidean')[source]¶ Bases:
yall.querystrategies.QueryStrategy
Distance to Center sampling. Measures the distance of each point to the average x (center) in the labeled data set and computes the similarity using the equation below.
\(x* = argmin_x \frac{1}{1 + dist(x, x_L)}\)
where dist(A, B) is the distance between vectors A and B.
\(x_L\) is the mean vector in L (i.e. L’s center).
Parameters: metric (str) – Distance metric to use. See spd.cdist doc for available metrics.
-
class
yall.querystrategies.
MinMax
(metric='euclidean')[source]¶ Bases:
yall.querystrategies.QueryStrategy
Finds the exmaple x in U that has the maximum smallest distance to every point in L. Ensures representative coverage of the dataset.
\(x^* = argmax_{x_i} ( min_{x_j} dist(x_i, x_j) )\)
where \(x_i \in U\), \(x_j \in L\), dist(.) is the given distance metric.
Parameters: metric (str) – Distance metric to use. See the spd.cdist doc for available metrics.
-
class
yall.querystrategies.
Density
(metric='euclidean')[source]¶ Bases:
yall.querystrategies.QueryStrategy
Finds the example x in U that has the greatest average distance to every other point in U.
\(x^* = argmin_x \frac{1}{U} \sum_{u=1} \frac{1}{1 + dist(x, x_u)}\)
Parameters: metric (str) – Distance metric to use. See spd.cdist doc for available metrics.