yall.querystrategies¶

class yall.querystrategies.UncertaintySampler(model_change=False)[source]¶

Bases: yall.querystrategies.QueryStrategy

choose(scores)[source]¶

Parameters:	scores (numpy.ndarray) – Output of self.score()
Returns:	Index of chosen example.
Return type:	int

model_change_wrapper(score_func)[source]¶

Model change wrapper around the scoring function. See doc for __score() above for usage insructions.

\(score_{mc}(X) = score(X; t) - w_o score(X; t-1)\)

\(score(X, t)\): The score at time t

\(w_o = \frac{1}{\mid L \mid}\)

Parameters:	score_func (function) – Scoring function to wrap.
Returns:	Wrapped scoring function.
Return type:	function

class yall.querystrategies.CombinedSampler(qs1=None, qs2=None, beta=1, choice_metric=<function argmax>)[source]¶

Bases: yall.querystrategies.QueryStrategy

Allows one sampler’s scores to be weighted by anothers according to the equation:

\(score(x) = score_{qs1}(x) \times score_{qs2}(x)^{\beta}\)

Assumes \(x^* = argmax(score)\)

Parameters:	qs1 (QueryStrategy) – Main query strategy. qs2 (QueryStrategy) – Query strategy to use as weight. beta (float) – Scale factor for score_qs2. choice_metric (function) – Function that takes a 1d np.array and returns a chosen index.

choose(scores)[source]¶: Returns the example with the “best” score according to self.choice_metric.

score(*args)[source]¶: Computes the combined scores from qs1 and qs2. :returns: scores :rtype: numpy.ndarray

class yall.querystrategies.DistDivSampler(qs1=None, qs2=None, lam=0.5, choice_metric=<function argmax>)[source]¶

Bases: yall.querystrategies.QueryStrategy

Combined sampling method as in “Active learning for clinical text classification: is it better than random sampling?”

\(x^* = argmin_x (\lambda score_{qs1}(x) + (1 - \lambda) score_{qs2}(x))\)

Parameters:	qs1 (QueryStrategy) – Uncertainty sampling query strategy. qs2 (QueryStrategy) – Representative sampling query strategy. lambda (float) – Query strategy weight [0,1] or “dynamic”. choice_metric (function) – Function that takes a 1d np.array and returns a chosen index.

choose(scores)[source]¶: Returns the example with the “best” score according to self.choice_metric.

score(*args)[source]¶: Computes the combined scores from qs1 and qs2. :returns: scores :rtype: numpy.ndarray

class yall.querystrategies.Random[source]¶

Bases: yall.querystrategies.QueryStrategy

Random query strategy. Equivalent to passive learning.

choose(scores)[source]¶: Picks an index at random. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int

score(*args)[source]¶: In the random case, just output the indices.

class yall.querystrategies.SimpleMargin[source]¶

Bases: yall.querystrategies.QueryStrategy

Finds the example x that is closest to the separating hyperplane.

\(x^* = argmin_x |f(x)|\)

choose(scores)[source]¶: Returns the example with the shortest distance to the hyperplane. In the multiclass case, his will return the row index of the example with the smallest absolute distance to any hyperplane. Could be modified to choose the smallest average distance to all hyperplanes. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int

score(*args)[source]¶: Computes distances to the hyperplane for each member of the unlabeled set.

class yall.querystrategies.Margin[source]¶

Bases: yall.querystrategies.QueryStrategy

Margin Sampler. Chooses the member from the unlabeled set with the smallest difference between the posterior probabilities of the two most probable class labels.

\(x^* = argmin_x P(\hat{y_1}|x) - P(\hat{y_2}|x)\)

where \(\hat{y_1}\) is the most probable label

and \(\hat{y_2}\) is the second most probable label.

choose(scores)[source]¶: Returns the example with the smallest difference between the two most probable class labels. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int

score(*args)[source]¶: Computes the difference between posterior probability estimates for the top two most probable labels. :returns: Posterior probability differences. :rtype: numpy.ndarray

class yall.querystrategies.Entropy(model_change=False)[source]¶

Bases: yall.querystrategies.UncertaintySampler

Entropy Sampler. Chooses the member from the unlabeled set with the greatest entropy across possible labels.

\(x^* = argmax_x -\sum_i P(y_i|x) \times log_2(P(y_i|x))\)

class yall.querystrategies.LeastConfidence(model_change=False)[source]¶

Bases: yall.querystrategies.UncertaintySampler

Least confidence (uncertainty sampling). Chooses the member from the unlabeled set with the greatest uncertainty, i.e. the greatest posterior probability of all labels except the most likely one.

\(x^* = argmax_x 1 - P(\hat{y}|x)\)

where \(\hat{y} = argmax_y P(y|x)\)

class yall.querystrategies.LeastConfidenceBias(model_change=False)[source]¶

Bases: yall.querystrategies.UncertaintySampler

Least confidence with bias. This is the same as least confidence, but moves the decision boundary according to the current class distribution.

\[x^* = \Biggl \lbrace { \frac{P(\hat{y}|x)}{P_{max}}, \text{ if } {P(\hat{y}|x) < P_{max}} \atop \frac{1 - P(\hat{y}|x)}{P_{max}}, \text{ otherwise } }\]

where

\(P_{max} = mean(0.5, 1 - pp)\) and \(pp\) is the percentage of positive examples in the labeled set.

class yall.querystrategies.LeastConfidenceDynamicBias(model_change=False)[source]¶

Bases: yall.querystrategies.UncertaintySampler

Least confidence with dynamic bias. This is the same as least confidence with bias, but the bias also adjusts for the relative sizes of the labeled and unlabeled data sets.

\[x^* = \Biggl \lbrace { \frac{P(\hat{y}|x)}{P_{max}}, \text{ if } {P(\hat{y}|x) < P_{max}} \atop \frac{1 - P(\hat{y}|x)}{P_{max}}, \text{ otherwise } }\]

where

\(P_{max} = (1 - pp)w_b + 0.5w_y\)

\(pp\) is the percentage of positive examples in the labeled set.

\(w_u = \frac{|L|}{U_0}\) and \(U_0\) is the initial unlabeled set.

\(w_b = 1 - w_u\)

class yall.querystrategies.DistanceToCenter(metric='euclidean')[source]¶

Bases: yall.querystrategies.QueryStrategy

Distance to Center sampling. Measures the distance of each point to the average x (center) in the labeled data set and computes the similarity using the equation below.

\(x* = argmin_x \frac{1}{1 + dist(x, x_L)}\)

where dist(A, B) is the distance between vectors A and B.

\(x_L\) is the mean vector in L (i.e. L’s center).

Parameters:	metric (str) – Distance metric to use. See spd.cdist doc for available metrics.

choose(scores)[source]¶: Returns the example with the lowest similarity to the average x in L. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int

score(*args)[source]¶

Returns:	Distances.
Return type:	numpy.ndarray

class yall.querystrategies.MinMax(metric='euclidean')[source]¶

Bases: yall.querystrategies.QueryStrategy

Finds the exmaple x in U that has the maximum smallest distance to every point in L. Ensures representative coverage of the dataset.

\(x^* = argmax_{x_i} ( min_{x_j} dist(x_i, x_j) )\)

where \(x_i \in U\), \(x_j \in L\), dist(.) is the given distance metric.

Parameters:	metric (str) – Distance metric to use. See the spd.cdist doc for available metrics.

choose(scores)[source]¶: Returns the examples with the greatest minimum distance to every other x in L. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int

score(*args)[source]¶

Computes minimum distance between each member of unlabeled_x: and each member of labeled_x.

Returns:	Minimum distances from each unlabeled_x to each labeled_x.
Return type:	numpy.ndarray

class yall.querystrategies.Density(metric='euclidean')[source]¶

Bases: yall.querystrategies.QueryStrategy

Finds the example x in U that has the greatest average distance to every other point in U.

\(x^* = argmin_x \frac{1}{U} \sum_{u=1} \frac{1}{1 + dist(x, x_u)}\)

Parameters:	metric (str) – Distance metric to use. See spd.cdist doc for available metrics.

choose(scores)[source]¶: Returns the example with the lowest similarity to the average x in U. :param numpy.ndarray scores: Output of self.score() :returns: Index of chosen example. :rtype: int

score(*args)[source]¶: Computes average distance between each member of U and each other member of U. :returns: Minimum distances from each point in U to each other point. :rtype: numpy.ndarray