I wrote the below code for a pipeline to process data in a dataframe. On execution I get this error:
ValueError: not enough values to unpack (expected 3, got 2)
I suspect that the error is caused by the FunctionTransformer, but I cannot figure out what the issue is. Can anyone help me find the error?
The code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
def funct(X):
X.Age.replace({'<35':0,'<35':1},inplace=True)
X.Accessibility.replace({'No':0,'Yes':1},inplace=True)
X.MentalHealth.replace({'No':0,'Yes':1},inplace=True)
X.MainBranch.replace({'NotDev':0,'Dev':1},inplace=True)
X.YearsCode = np.sqrt(X.YearsCode)
X.YearsCodePro = np.sqrt(X.YearsCodePro)
X.PreviousSalary = np.sqrt(X.PreviousSalary)
X.ComputerSkills = np.sqrt(X.ComputerSkills)
X.Country = pd.util.hash_pandas_object(X.Country)
X.HaveWorkedWith = pd.util.hash_pandas_object(X.HaveWorkedWith)
data = pd.read_csv('stackoverflow_full.csv')
data.info()
X = data.drop('Employed',axis=True)
y = data['Employed'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train, X_eval, y_train, y_eval = train_test_split(X_train,y_train,test_size=0.01,random_state=42)
#Columns to process
all_cols = list(data.columns)
all_cols.remove('Unnamed: 0')
drop_cols = ['Unnamed: 0']
onehot_cols = ['Gender']
ordinal_cols = ['EdLevel']
impute_cols = ['HaveWorkedWith']
#Instantiate Transformers to process columns
func_transformer = FunctionTransformer(func= funct)
onehot_transformer = Pipeline(steps=[('onehot encode',OneHotEncoder(handle_unknown='ignore'))],verbose=True)
ordinal_transformer = Pipeline(steps=[('ordinal encode',OrdinalEncoder())],verbose=True)
impute_transformer = Pipeline(steps=[('imputing',SimpleImputer(strategy='most frequent'))],verbose=True)
scaling_transformer = Pipeline(steps=[('scaling',StandardScaler())],verbose=True)
preprocessing = ColumnTransformer(transformers=[('drop cols','drop',drop_cols),('funcT',func_transformer),('onehot',onehot_transformer,onehot_cols),('ordinal',ordinal_transformer,ordinal_cols),('impute',impute_transformer,impute_cols),('scale',scaling_transformer,all_cols)],verbose=True)
model = Pipeline(steps=[('preprocessing',preprocessing),('clustering',KMeans(n_clusters=2))],verbose=True)
model.fit_transform(X_train,y_train)
ValueError Traceback (most recent call last)
<ipython-input-45-20aa569525ea> in <cell line: 1>()
----> 1 model.fit_transform(X_train,y_train)
6 frames
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
435 """
436 fit_params_steps = self._check_fit_params(**fit_params)
--> 437 Xt = self._fit(X, y, **fit_params_steps)
439 last_step = self._final_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
357 cloned_transformer = clone(transformer)
358 # Fit or load from cache the current transformer
--> 359 X, fitted_transformer = fit_transform_one_cached(
360 cloned_transformer,
361 X,
/usr/local/lib/python3.10/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
352 def __call__(self, *args, **kwargs):
--> 353 return self.func(*args, **kwargs)
355 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, "fit_transform"):
--> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142 # only wrap the first output for cross decomposition
/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
721 # set n_features_in_ attribute
722 self._check_n_features(X, reset=True)
--> 723 self._validate_transformers()
724 self._validate_column_callables(X)
725 self._validate_remainder(X)
/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py in _validate_transformers(self)
396 return
--> 398 names, transformers, _ = zip(*self.transformers)
400 # validate names
ValueError: not enough values to unpack (expected 3, got 2)
Thanks - I may have a look later - don’t have much time now.
Btw, totally unrelated to your problem, but this line is a bit “jarring”:
X = data.drop('Employed',axis=True)
axis
should either be the integer 1 or the string “columns” here - makes the code nicer. In terms of styling, it’s also nicer to always put a space behind a comma - which really does make it easier to read. (Linters will also frown upon your styling.)
No, prob. I don’t have direct experience with using these pipelines (just toyed with them quite a while ago). But this should not be too hard to debug for yourself. The stack trace unfortunately doesn’t tell in which transform the error is located, so isn’t that much help. But to debug, have you tried to
run everything after first commenting out use of the funct
transform - just to make sure the mechanics of the rest are ok
apply funct
directly to the training data, to verify it works as expected
double-checked that a transforming function can act in place (?!) since your funct is not returning anything
Yes, tested all of the above and it works.
Because the trace doesn’t tell me where in the pipeline the failure is I’m lost.
Is there a different way to debug in Python? I find the trace mostly obscure and insufficient.
I have often felt the same frustration. Still the stack trace does give some vital, perhaps usable info.
One way to proceed is step through the code in a Python debugger and identify which object is causing the error.
$ python -m pdb --help
If you’re not familiar with that, it’s may have a bit of learning curve but is not too difficult to use… (has built-in help
and help
on each of the built-in debug commands).
There is a pretty decent tutorial at Python Debugging With Pdb – Real Python
(In the distant past I’ve also use a GUI for Python debugging on Windows, but I’m not familiar with any current ones, so cannot give any hints for that. VSCode might made debugging easier.)
Another way, consider line 398, and consider the error message. There is a ValueError, expecting 3 objects (398 expects 3), but only 2 objects are given. So, this suggests, something is wrong with the self.transformers
. That list or container needs to contain sub-lists (or tuples etc) of exactly 3 items, but it only got 2 in at least one of the sub-lists.
Also, this is coming (apparently) from the ColumnTransformer. Ok, then, without knowing anything more about underlying code, consider the way you defined that…
First try out my earlier suggestion: Remove your custom transformer from the ColumnTransformer to verify that the rest is OK. (Something else might go wrong, but as long as it happens later, that’s fine.) If so, then consider if you passed that in correctly.
There are other, quick-and-dirty ways to debug, but I think the second one will let you proceed…
Also take another look at the docs for ColumnTransformer (notes about the transformers argument).
(Not knowing any of these APIs, I do wonder, can the custom function be a function defined on a whole dataframe?? Should it not be a function that is only defined on a scalar or series? The example in the FunctionTransformer doc also suggests this. The fit
function of the class will work on (certain cols of) a dataframe, but that does not imply that the defining function should/can take a dataframe.)
Found the source of the error. I misunderstood the implementation of a Function Transformer. It wraps a function which gets applied to every value in a given dataframe colums. One passes the given columns to the transformer and it returns the new column values. In my implementation I assumed the underlying implementation provided unrestricted access to the entire dataframe - not true.
The second issue caused the error I saw. To execute the transformer in a pipeline, you pass it 3 parameters: 1. a descriptor of the step which can be used to identify the step. 2. The transformer to be executed at this step in the pipeline. 3. the columns of the dataframe to which the transformer must be applied.
I did not specify this third parameter for the Function Transformer and this caused the error. Once I corrected this I got a second error which indicated that the Function Transformer is broken.
So, my question is answered and now I need to relook my approach. But I guess that’s all in a day’s work.
I will certainly have a look at pdb, thank you for the tip.