Y’ALL: Yet another Active Learning Library

Prerequisites

Installation

Clone or download this repository and run:

python setup.py install

A motivating example

Active learning can often discover a subset of the full data set that generalizes well to the test set. For example, we consider the Iris data set:

>>> import numpy as np
>>> from yall import ActiveLearningModel
>>> from yall.querystrategies import Margin
>>> from yall.utils import plot_learning_curve
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression as LR
>>> from sklearn.base import clone
>>> np.random.seed(0)
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
>>> lr = LR(solver="liblinear", multi_class="auto")
>>> lr = lr.fit(train_X, train_y)
>>> print(lr.score(test_X, test_y))
0.967

Using the full data set, logistic regression acheives an accuracy of 0.967 on the test data.

>>> alm = ActiveLearningModel(clone(lr), Margin(),
...                           eval_metric="accuracy",
...                           U_proportion=0.95, random_state=0)
>>> accuracies, choices = alm.run(train_X, test_X, train_y, test_y)
>>> plot_learning_curve(accuracies, 0, len(accuracies),
...                     eval_metric="accuracy")
_images/learning_curve.png

From the learning curve we see that only the first 25 or so data points are required to acheive perfect 1.0 accuracy on the test data.

>>> lr_small = clone(lr)
>>> lr_small = lr_small.fit(alm.L.X[:25, ], alm.L.y[:25])
>>> print(lr_small.score(test_X, test_y))
1.0

Supported query strategies

  • Random Sampling (passive learning)
  • Uncertainty Sampling
    • Entropy Sampling
    • Least Confidence
    • Least Confidence with Bias
    • Least Confidence with Dynamic Bias
    • Margin Sampling
    • Simple Margin Sampling
  • Representative Sampling
    • Density Sampling
    • Distance to Center
    • MinMax Sampling
  • Combined Sampling
    • Beta-weighted Combined Sampling
    • Lambda-weighted Combined Sampling

Running Tests

First install pytest-cov

Then, from the project home directory run

py.test --cov=yall tests

Authors

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgements

This project grew out of a study of active learning methods for biomedical text classification. The paper associated with this study can be found at https://doi.org/10.1093/jamiaopen/ooy021