Coding categorical dataΒΆ

Patsy allows great flexibility in how categorical data is coded, via the function C(). C() marks some data as being categorical (including data which would not automatically be treated as categorical, such as a column of integers), while also optionally setting the preferred coding scheme and level ordering.

Let’s get some categorical data to work with:

In [1]: from patsy import dmatrix, demo_data, ContrastMatrix, Poly

In [2]: data = demo_data("a", nlevels=3)

In [3]: data
 Out[3]: {'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}

As you know, simply giving Patsy a categorical variable causes it to be coded using the default Treatment coding scheme. (Strings and booleans are treated as categorical by default.)

In [1]: dmatrix("a", data)
 Out[1]: 
DesignMatrix with shape (6, 3)
  Intercept  a[T.a2]  a[T.a3]
          1        0        0
          1        1        0
          1        0        1
          1        0        0
          1        1        0
          1        0        1
  Terms:
    'Intercept' (column 0)
    'a' (columns 1:3)

We can also alter the level ordering, which is useful for, e.g., Diff coding:

In [2]: l = ["a3", "a2", "a1"]

In [3]: dmatrix("C(a, levels=l)", data)
 Out[3]: 
DesignMatrix with shape (6, 3)
  Intercept  C(a, levels=l)[T.a2]  C(a, levels=l)[T.a1]
          1                     0                     1
          1                     1                     0
          1                     0                     0
          1                     0                     1
          1                     1                     0
          1                     0                     0
  Terms:
    'Intercept' (column 0)
    'C(a, levels=l)' (columns 1:3)

But the default coding is just that – a default. The easiest alternative is to use one of the other built-in coding schemes, like orthogonal polynomial coding:

In [4]: dmatrix("C(a, Poly)", data)
 Out[4]: 
DesignMatrix with shape (6, 3)
  Intercept  C(a, Poly).Linear  C(a, Poly).Quadratic
          1           -0.70711               0.40825
          1           -0.00000              -0.81650
          1            0.70711               0.40825
          1           -0.70711               0.40825
          1           -0.00000              -0.81650
          1            0.70711               0.40825
  Terms:
    'Intercept' (column 0)
    'C(a, Poly)' (columns 1:3)

There are a number of built-in coding schemes; for details you can check the API reference. But we aren’t restricted to those. We can also provide a custom contrast matrix, which allows us to produce all kinds of strange designs:

In [5]: contrast = [[1, 2], [3, 4], [5, 6]]

In [6]: dmatrix("C(a, contrast)", data)
 Out[6]: 
DesignMatrix with shape (6, 3)
  Intercept  C(a, contrast)[custom0]  C(a, contrast)[custom1]
          1                        1                        2
          1                        3                        4
          1                        5                        6
          1                        1                        2
          1                        3                        4
          1                        5                        6
  Terms:
    'Intercept' (column 0)
    'C(a, contrast)' (columns 1:3)

In [7]: dmatrix("C(a, [[1], [2], [-4]])", data)
 Out[7]: 
DesignMatrix with shape (6, 2)
  Intercept  C(a, [[1], [2], [-4]])[custom0]
          1                                1
          1                                2
          1                               -4
          1                                1
          1                                2
          1                               -4
  Terms:
    'Intercept' (column 0)
    'C(a, [[1], [2], [-4]])' (column 1)

Hmm, those [custom0], [custom1] names that Patsy auto-generated for us are a bit ugly looking. We can attach names to our contrast matrix by creating a ContrastMatrix object, and make things prettier:

In [8]: contrast_mat = ContrastMatrix(contrast, ["[pretty0]", "[pretty1]"])

In [9]: dmatrix("C(a, contrast_mat)", data)
 Out[9]: 
DesignMatrix with shape (6, 3)
  Intercept  C(a, contrast_mat)[pretty0]  C(a, contrast_mat)[pretty1]
          1                            1                            2
          1                            3                            4
          1                            5                            6
          1                            1                            2
          1                            3                            4
          1                            5                            6
  Terms:
    'Intercept' (column 0)
    'C(a, contrast_mat)' (columns 1:3)

And, finally, if we want to get really fancy, we can also define our own “smart” coding schemes like Poly. Just define a class that has two methods, code_with_intercept() and code_without_intercept(). They have identical signatures, taking a list of levels as their argument and returning a ContrastMatrix. Patsy will automatically choose the appropriate method to call to produce a full-rank design matrix without redundancy; see Redundancy and categorical factors for the full details on how Patsy makes this decision.

As an example, here’s a simplified version of the built-in Treatment coding object:

import numpy as np

class MyTreat(object):
    def __init__(self, reference=0):
        self.reference = reference

    def code_with_intercept(self, levels):
        return ContrastMatrix(np.eye(len(levels)),
                              ["[My.%s]" % (level,) for level in levels])

    def code_without_intercept(self, levels):
        eye = np.eye(len(levels) - 1)
        contrasts = np.vstack((eye[:self.reference, :],
                               np.zeros((1, len(levels) - 1)),
                               eye[self.reference:, :]))
        suffixes = ["[MyT.%s]" % (level,) for level in
                    levels[:self.reference] + levels[self.reference + 1:]]
        return ContrastMatrix(contrasts, suffixes)

And it can now be used just like the built-in methods:

# Full rank:
In [11]: dmatrix("0 + C(a, MyTreat)", data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-fc2731b99fa5> in <module>()
----> 1 dmatrix("0 + C(a, MyTreat)", data)

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
    276     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    277     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 278                                       NA_action, return_type)
    279     if lhs.shape[1] != 0:
    280         raise PatsyError("encountered outcome variables for a model "

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                   NA_action)
    153     if builders is not None:
    154         return build_design_matrices(builders, data,

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                        formula_like.rhs_termlist],
     56                                       data_iter_maker,
---> 57                                       NA_action)
     58     else:
     59         return None

/builddir/build/BUILD/patsy-0.2.1/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, NA_action)
    655                                                    factor_states,
    656                                                    data_iter_maker,
--> 657                                                    NA_action)
    658     # Now we need the factor evaluators, which encapsulate the knowledge of
    659     # how to turn any given factor into a chunk of data:

/builddir/build/BUILD/patsy-0.2.1/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    419     for data in data_iter_maker():
    420         for factor in list(examine_needed):
--> 421             value = factor.eval(factor_states[factor], data)
    422             if factor in cat_sniffers or guess_categorical(value):
    423                 if factor not in cat_sniffers:

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in eval(self, memorize_state, data)
    478     #    http://nedbatchelder.com/blog/200711/rethrowing_exceptions_in_python.html
    479     def eval(self, memorize_state, data):
--> 480         return self._eval(memorize_state["eval_code"], memorize_state, data)
    481 
    482 def test_EvalFactor_basics():

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in _eval(self, code, memorize_state, data)
    461                                  self,
    462                                  self._eval_env.eval,
--> 463                                  code, inner_namespace=inner_namespace)
    464 
    465     def memorize_chunk(self, state, which_pass, data):

/builddir/build/BUILD/patsy-0.2.1/patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
    131 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
    132     try:
--> 133         return f(*args, **kwargs)
    134     except Exception, e:
    135         if sys.version_info[0] >= 3:

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in eval(self, expr, source_name, inner_namespace)
    120         code = compile(expr, source_name, "eval", self.flags, False)
    121         return eval(code, {}, VarLookupDict([inner_namespace]
--> 122                                             + self._namespaces))
    123 
    124     @classmethod

<string> in <module>()

NameError: name 'MyTreat' is not defined

# Reduced rank:
In [12]: dmatrix("C(a, MyTreat)", data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-09011f1be5a2> in <module>()
----> 1 dmatrix("C(a, MyTreat)", data)

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
    276     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    277     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 278                                       NA_action, return_type)
    279     if lhs.shape[1] != 0:
    280         raise PatsyError("encountered outcome variables for a model "

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                   NA_action)
    153     if builders is not None:
    154         return build_design_matrices(builders, data,

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                        formula_like.rhs_termlist],
     56                                       data_iter_maker,
---> 57                                       NA_action)
     58     else:
     59         return None

/builddir/build/BUILD/patsy-0.2.1/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, NA_action)
    655                                                    factor_states,
    656                                                    data_iter_maker,
--> 657                                                    NA_action)
    658     # Now we need the factor evaluators, which encapsulate the knowledge of
    659     # how to turn any given factor into a chunk of data:

/builddir/build/BUILD/patsy-0.2.1/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    419     for data in data_iter_maker():
    420         for factor in list(examine_needed):
--> 421             value = factor.eval(factor_states[factor], data)
    422             if factor in cat_sniffers or guess_categorical(value):
    423                 if factor not in cat_sniffers:

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in eval(self, memorize_state, data)
    478     #    http://nedbatchelder.com/blog/200711/rethrowing_exceptions_in_python.html
    479     def eval(self, memorize_state, data):
--> 480         return self._eval(memorize_state["eval_code"], memorize_state, data)
    481 
    482 def test_EvalFactor_basics():

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in _eval(self, code, memorize_state, data)
    461                                  self,
    462                                  self._eval_env.eval,
--> 463                                  code, inner_namespace=inner_namespace)
    464 
    465     def memorize_chunk(self, state, which_pass, data):

/builddir/build/BUILD/patsy-0.2.1/patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
    131 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
    132     try:
--> 133         return f(*args, **kwargs)
    134     except Exception, e:
    135         if sys.version_info[0] >= 3:

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in eval(self, expr, source_name, inner_namespace)
    120         code = compile(expr, source_name, "eval", self.flags, False)
    121         return eval(code, {}, VarLookupDict([inner_namespace]
--> 122                                             + self._namespaces))
    123 
    124     @classmethod

<string> in <module>()

NameError: name 'MyTreat' is not defined

# With argument:
In [13]: dmatrix("C(a, MyTreat(2))", data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-324e4d268f2e> in <module>()
----> 1 dmatrix("C(a, MyTreat(2))", data)

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
    276     eval_env = EvalEnvironment.capture(eval_env, reference=1)
    277     (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 278                                       NA_action, return_type)
    279     if lhs.shape[1] != 0:
    280         raise PatsyError("encountered outcome variables for a model "

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    150         return iter([data])
    151     builders = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 152                                   NA_action)
    153     if builders is not None:
    154         return build_design_matrices(builders, data,

/builddir/build/BUILD/patsy-0.2.1/patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     55                                        formula_like.rhs_termlist],
     56                                       data_iter_maker,
---> 57                                       NA_action)
     58     else:
     59         return None

/builddir/build/BUILD/patsy-0.2.1/patsy/build.py in design_matrix_builders(termlists, data_iter_maker, NA_action)
    655                                                    factor_states,
    656                                                    data_iter_maker,
--> 657                                                    NA_action)
    658     # Now we need the factor evaluators, which encapsulate the knowledge of
    659     # how to turn any given factor into a chunk of data:

/builddir/build/BUILD/patsy-0.2.1/patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    419     for data in data_iter_maker():
    420         for factor in list(examine_needed):
--> 421             value = factor.eval(factor_states[factor], data)
    422             if factor in cat_sniffers or guess_categorical(value):
    423                 if factor not in cat_sniffers:

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in eval(self, memorize_state, data)
    478     #    http://nedbatchelder.com/blog/200711/rethrowing_exceptions_in_python.html
    479     def eval(self, memorize_state, data):
--> 480         return self._eval(memorize_state["eval_code"], memorize_state, data)
    481 
    482 def test_EvalFactor_basics():

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in _eval(self, code, memorize_state, data)
    461                                  self,
    462                                  self._eval_env.eval,
--> 463                                  code, inner_namespace=inner_namespace)
    464 
    465     def memorize_chunk(self, state, which_pass, data):

/builddir/build/BUILD/patsy-0.2.1/patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
    131 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
    132     try:
--> 133         return f(*args, **kwargs)
    134     except Exception, e:
    135         if sys.version_info[0] >= 3:

/builddir/build/BUILD/patsy-0.2.1/patsy/eval.py in eval(self, expr, source_name, inner_namespace)
    120         code = compile(expr, source_name, "eval", self.flags, False)
    121         return eval(code, {}, VarLookupDict([inner_namespace]
--> 122                                             + self._namespaces))
    123 
    124     @classmethod

<string> in <module>()

NameError: name 'MyTreat' is not defined

Previous topic

How formulas work

Next topic

Stateful transforms

This Page