
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/impute/plot_iterative_imputer_variants_comparison.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_impute_plot_iterative_imputer_variants_comparison.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_impute_plot_iterative_imputer_variants_comparison.py:


=========================================================
Imputing missing values with variants of IterativeImputer
=========================================================

The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
used with a variety of estimators to do round-robin regression, treating every
variable as an output in turn.

In this example we compare some estimators for the purpose of missing feature
imputation with :class:`sklearn.impute.IterativeImputer`:

* :class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
* :class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
* :class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
* :class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
  imputation approaches

Of particular interest is the ability of
:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
popular imputation package for R. In this example, we have chosen to use
:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
increased speed.

Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
imputation, which learns from samples with missing values by using a distance
metric that accounts for missing values, rather than imputing them.

The goal is to compare different estimators to see which one is best for the
:class:`sklearn.impute.IterativeImputer` when using a
:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
dataset with a single value randomly removed from each row.

For this particular pattern of missing values we see that
:class:`sklearn.ensemble.ExtraTreesRegressor` and
:class:`sklearn.linear_model.BayesianRidge` give the best results.

.. GENERATED FROM PYTHON SOURCE LINES 39-133


.. rst-class:: sphx-glr-script-out

.. code-block:: pytb

    Traceback (most recent call last):
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/examples/impute/plot_iterative_imputer_variants_comparison.py", line 61, in <module>
        X_full, y_full = fetch_california_housing(return_X_y=True)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/utils/validation.py", line 72, in inner_f
        return f(**kwargs)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_california_housing.py", line 135, in fetch_california_housing
        archive_path = _fetch_remote(ARCHIVE, dirname=data_home)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_base.py", line 1181, in _fetch_remote
        urlretrieve(remote.url, file_path)
      File "/usr/lib/python3.10/urllib/request.py", line 241, in urlretrieve
        with contextlib.closing(urlopen(url, data)) as fp:
      File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/lib/python3.10/urllib/request.py", line 519, in open
        response = self._open(req, data)
      File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
        result = self._call_chain(self.handle_open, protocol, protocol +
      File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
        result = func(*args)
      File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
        return self.do_open(http.client.HTTPSConnection, req,
      File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
        raise URLError(err)
    urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>






|

.. code-block:: default

    print(__doc__)

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd

    # To use this experimental feature, we need to explicitly ask for it:
    from sklearn.experimental import enable_iterative_imputer  # noqa
    from sklearn.datasets import fetch_california_housing
    from sklearn.impute import SimpleImputer
    from sklearn.impute import IterativeImputer
    from sklearn.linear_model import BayesianRidge
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import ExtraTreesRegressor
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.pipeline import make_pipeline
    from sklearn.model_selection import cross_val_score

    N_SPLITS = 5

    rng = np.random.RandomState(0)

    X_full, y_full = fetch_california_housing(return_X_y=True)
    # ~2k samples is enough for the purpose of the example.
    # Remove the following two lines for a slower run with different error bars.
    X_full = X_full[::10]
    y_full = y_full[::10]
    n_samples, n_features = X_full.shape

    # Estimate the score on the entire dataset, with no missing values
    br_estimator = BayesianRidge()
    score_full_data = pd.DataFrame(
        cross_val_score(
            br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
            cv=N_SPLITS
        ),
        columns=['Full Data']
    )

    # Add a single missing value to each row
    X_missing = X_full.copy()
    y_missing = y_full
    missing_samples = np.arange(n_samples)
    missing_features = rng.choice(n_features, n_samples, replace=True)
    X_missing[missing_samples, missing_features] = np.nan

    # Estimate the score after imputation (mean and median strategies)
    score_simple_imputer = pd.DataFrame()
    for strategy in ('mean', 'median'):
        estimator = make_pipeline(
            SimpleImputer(missing_values=np.nan, strategy=strategy),
            br_estimator
        )
        score_simple_imputer[strategy] = cross_val_score(
            estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
            cv=N_SPLITS
        )

    # Estimate the score after iterative imputation of the missing values
    # with different estimators
    estimators = [
        BayesianRidge(),
        DecisionTreeRegressor(max_features='sqrt', random_state=0),
        ExtraTreesRegressor(n_estimators=10, random_state=0),
        KNeighborsRegressor(n_neighbors=15)
    ]
    score_iterative_imputer = pd.DataFrame()
    for impute_estimator in estimators:
        estimator = make_pipeline(
            IterativeImputer(random_state=0, estimator=impute_estimator),
            br_estimator
        )
        score_iterative_imputer[impute_estimator.__class__.__name__] = \
            cross_val_score(
                estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
                cv=N_SPLITS
            )

    scores = pd.concat(
        [score_full_data, score_simple_imputer, score_iterative_imputer],
        keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
    )

    # plot california housing results
    fig, ax = plt.subplots(figsize=(13, 6))
    means = -scores.mean()
    errors = scores.std()
    means.plot.barh(xerr=errors, ax=ax)
    ax.set_title('California Housing Regression with Different Imputation Methods')
    ax.set_xlabel('MSE (smaller is better)')
    ax.set_yticks(np.arange(means.shape[0]))
    ax.set_yticklabels([" w/ ".join(label) for label in means.index.tolist()])
    plt.tight_layout(pad=1)
    plt.show()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.016 seconds)


.. _sphx_glr_download_auto_examples_impute_plot_iterative_imputer_variants_comparison.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example



  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_iterative_imputer_variants_comparison.py <plot_iterative_imputer_variants_comparison.py>`



  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_iterative_imputer_variants_comparison.ipynb <plot_iterative_imputer_variants_comparison.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
