
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/compose/plot_column_transformer_mixed_types.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py:


===================================
Column Transformer with Mixed Types
===================================

This example illustrates how to apply different preprocessing and feature
extraction pipelines to different subsets of features, using
:class:`sklearn.compose.ColumnTransformer`. This is particularly handy for the
case of datasets that contain heterogeneous data types, since we may want to
scale the numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after mean-imputation,
while the categorical data is one-hot encoded after imputing missing values
with a new category (``'missing'``).

In addition, we show two different ways to dispatch the columns to the
particular pre-processor: by column names and by column data types.

Finally, the preprocessing pipeline is integrated in a full prediction pipeline
using :class:`sklearn.pipeline.Pipeline`, together with a simple classification
model.

.. GENERATED FROM PYTHON SOURCE LINES 23-47

.. code-block:: default


    # Author: Pedro Morales <part.morales@gmail.com>
    #
    # License: BSD 3 clause

    import numpy as np

    from sklearn.compose import ColumnTransformer
    from sklearn.datasets import fetch_openml
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, GridSearchCV

    np.random.seed(0)

    # Load data from https://www.openml.org/d/40945
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

    # Alternatively X and y can be obtained directly from the frame attribute:
    # X = titanic.frame.drop('survived', axis=1)
    # y = titanic.frame['survived']



.. rst-class:: sphx-glr-script-out

.. code-block:: pytb

    Traceback (most recent call last):
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/examples/compose/plot_column_transformer_mixed_types.py", line 41, in <module>
        X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/utils/validation.py", line 72, in inner_f
        return f(**kwargs)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_openml.py", line 738, in fetch_openml
        data_info = _get_data_info_by_name(name, version, data_home)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_openml.py", line 381, in _get_data_info_by_name
        json_data = _get_json_content_from_openml_api(url, None, False,
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_openml.py", line 161, in _get_json_content_from_openml_api
        return _load_json()
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_openml.py", line 61, in wrapper
        return f(*args, **kw)
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_openml.py", line 157, in _load_json
        with closing(_open_openml_url(url, data_home)) as response:
      File "/build/scikit-learn-ZSX7SD/scikit-learn-0.23.2/.pybuild/cpython3_3.10/build/sklearn/datasets/_openml.py", line 106, in _open_openml_url
        with closing(urlopen(req)) as fsrc:
      File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/lib/python3.10/urllib/request.py", line 519, in open
        response = self._open(req, data)
      File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
        result = self._call_chain(self.handle_open, protocol, protocol +
      File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
        result = func(*args)
      File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
        return self.do_open(http.client.HTTPSConnection, req,
      File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open
        raise URLError(err)
    urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>




.. GENERATED FROM PYTHON SOURCE LINES 48-64

Use ``ColumnTransformer`` by selecting column by names
##############################################################################
 We will train our classifier with the following features:

 Numeric Features:

 * ``age``: float;
 * ``fare``: float.

 Categorical Features:

 * ``embarked``: categories encoded as strings ``{'C', 'S', 'Q'}``;
 * ``sex``: categories encoded as strings ``{'female', 'male'}``;
 * ``pclass``: ordinal integers ``{1, 2, 3}``.

 We create the preprocessing pipelines for both numeric and categorical data.

.. GENERATED FROM PYTHON SOURCE LINES 64-90

.. code-block:: default


    numeric_features = ['age', 'fare']
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

    categorical_features = ['embarked', 'sex', 'pclass']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])

    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression())])

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))


.. GENERATED FROM PYTHON SOURCE LINES 91-95

HTML representation of ``Pipeline``
##############################################################################
 When the ``Pipeline`` is printed out in a jupyter notebook an HTML
 representation of the estimator is displayed as follows:

.. GENERATED FROM PYTHON SOURCE LINES 95-99

.. code-block:: default

    from sklearn import set_config
    set_config(display='diagram')
    clf


.. GENERATED FROM PYTHON SOURCE LINES 100-108

Use ``ColumnTransformer`` by selecting column by data types
##############################################################################
 When dealing with a cleaned dataset, the preprocessing can be automatic by
 using the data types of the column to decide whether to treat a column as a
 numerical or categorical feature.
 :func:`sklearn.compose.make_column_selector` gives this possibility.
 First, let's only select a subset of columns to simplify our
 example.

.. GENERATED FROM PYTHON SOURCE LINES 108-112

.. code-block:: default


    subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']
    X = X[subset_feature]


.. GENERATED FROM PYTHON SOURCE LINES 113-114

Then, we introspect the information regarding each column data type.

.. GENERATED FROM PYTHON SOURCE LINES 114-117

.. code-block:: default


    X.info()


.. GENERATED FROM PYTHON SOURCE LINES 118-123

We can observe that the `embarked` and `sex` columns were tagged as
`category` columns when loading the data with ``fetch_openml``. Therefore, we
can use this information to dispatch the categorical columns to the
``categorical_transformer`` and the remaining columns to the
``numerical_transformer``.

.. GENERATED FROM PYTHON SOURCE LINES 125-130

.. note:: In practice, you will have to handle yourself the column data type.
   If you want some columns to be considered as `category`, you will have to
   convert them into categorical columns. If you are using pandas, you can
   refer to their documentation regarding `Categorical data
   <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_.

.. GENERATED FROM PYTHON SOURCE LINES 130-144

.. code-block:: default


    from sklearn.compose import make_column_selector as selector

    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, selector(dtype_exclude="category")),
        ('cat', categorical_transformer, selector(dtype_include="category"))
    ])

    # Reproduce the identical fit/score process
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))


.. GENERATED FROM PYTHON SOURCE LINES 145-153

Using the prediction pipeline in a grid search
##############################################################################
 Grid search can also be performed on the different preprocessing steps
 defined in the ``ColumnTransformer`` object, together with the classifier's
 hyperparameters as part of the ``Pipeline``.
 We will search for both the imputer strategy of the numeric preprocessing
 and the regularization parameter of the logistic regression using
 :class:`sklearn.model_selection.GridSearchCV`.

.. GENERATED FROM PYTHON SOURCE LINES 153-164

.. code-block:: default


    param_grid = {
        'preprocessor__num__imputer__strategy': ['mean', 'median'],
        'classifier__C': [0.1, 1.0, 10, 100],
    }

    grid_search = GridSearchCV(clf, param_grid, cv=10)
    grid_search.fit(X_train, y_train)

    print(("best logistic regression from grid search: %.3f"
           % grid_search.score(X_test, y_test)))


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.027 seconds)


.. _sphx_glr_download_auto_examples_compose_plot_column_transformer_mixed_types.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example



  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_column_transformer_mixed_types.py <plot_column_transformer_mixed_types.py>`



  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_column_transformer_mixed_types.ipynb <plot_column_transformer_mixed_types.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
