Help with error on Python Pipeline - Python Help - Discussions on Python.org

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
忐忑的铁板烧 · CRS (Geotools modules ...· 1 周前 ·
安静的西装 · unity2d 对话系统 unity 2d ...· 昨天 ·
愉快的钱包 · 左公子微密圈资源-密资库· 1 月前 ·
沉着的火车 · ZDF80-40-400W汇川伺服电机专用行 ...· 3 月前 ·
仗义的冲锋衣 · 州党委书记安征宇元旦期间深入基层一线看望慰问 ...· 5 月前 ·
爱搭讪的抽屉 · 计算1亿对Hamming ...· 5 月前 ·
博学的黄瓜 · 【python学习】PyQt基础学习以及一个 ...· 6 月前 ·
I wrote the below code for a pipeline to process data in a dataframe. On execution I get this error:
ValueError: not enough values to unpack (expected 3, got 2)
I suspect that the error is caused by the FunctionTransformer, but I cannot figure out what the issue is. Can anyone help me find the error?
The code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
def funct(X):
  X.Age.replace({'<35':0,'<35':1},inplace=True)
  X.Accessibility.replace({'No':0,'Yes':1},inplace=True)
  X.MentalHealth.replace({'No':0,'Yes':1},inplace=True)
  X.MainBranch.replace({'NotDev':0,'Dev':1},inplace=True)
  X.YearsCode = np.sqrt(X.YearsCode)
  X.YearsCodePro = np.sqrt(X.YearsCodePro)
  X.PreviousSalary = np.sqrt(X.PreviousSalary)
  X.ComputerSkills = np.sqrt(X.ComputerSkills)
  X.Country = pd.util.hash_pandas_object(X.Country)
  X.HaveWorkedWith = pd.util.hash_pandas_object(X.HaveWorkedWith)
data = pd.read_csv('stackoverflow_full.csv')
data.info()
X = data.drop('Employed',axis=True)
y = data['Employed'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train, X_eval, y_train, y_eval = train_test_split(X_train,y_train,test_size=0.01,random_state=42)
#Columns to process
all_cols = list(data.columns)
all_cols.remove('Unnamed: 0')
drop_cols = ['Unnamed: 0']
onehot_cols = ['Gender']
ordinal_cols = ['EdLevel']
impute_cols = ['HaveWorkedWith']
#Instantiate Transformers to process columns
func_transformer = FunctionTransformer(func= funct)
onehot_transformer = Pipeline(steps=[('onehot encode',OneHotEncoder(handle_unknown='ignore'))],verbose=True)
ordinal_transformer = Pipeline(steps=[('ordinal encode',OrdinalEncoder())],verbose=True)
impute_transformer = Pipeline(steps=[('imputing',SimpleImputer(strategy='most frequent'))],verbose=True)
scaling_transformer = Pipeline(steps=[('scaling',StandardScaler())],verbose=True)
preprocessing = ColumnTransformer(transformers=[('drop cols','drop',drop_cols),('funcT',func_transformer),('onehot',onehot_transformer,onehot_cols),('ordinal',ordinal_transformer,ordinal_cols),('impute',impute_transformer,impute_cols),('scale',scaling_transformer,all_cols)],verbose=True)
model = Pipeline(steps=[('preprocessing',preprocessing),('clustering',KMeans(n_clusters=2))],verbose=True)
model.fit_transform(X_train,y_train)
              ValueError                                Traceback (most recent call last)
<ipython-input-45-20aa569525ea> in <cell line: 1>()
----> 1 model.fit_transform(X_train,y_train)
6 frames
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    435         """
    436         fit_params_steps = self._check_fit_params(**fit_params)
--> 437         Xt = self._fit(X, y, **fit_params_steps)
    439         last_step = self._final_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    357                 cloned_transformer = clone(transformer)
    358             # Fit or load from cache the current transformer
--> 359             X, fitted_transformer = fit_transform_one_cached(
    360                 cloned_transformer,
    361                 X,
/usr/local/lib/python3.10/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    352     def __call__(self, *args, **kwargs):
--> 353         return self.func(*args, **kwargs)
    355     def call_and_shelve(self, *args, **kwargs):
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    891     with _print_elapsed_time(message_clsname, message):
    892         if hasattr(transformer, "fit_transform"):
--> 893             res = transformer.fit_transform(X, y, **fit_params)
    894         else:
    895             res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition
/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    721         # set n_features_in_ attribute
    722         self._check_n_features(X, reset=True)
--> 723         self._validate_transformers()
    724         self._validate_column_callables(X)
    725         self._validate_remainder(X)
/usr/local/lib/python3.10/dist-packages/sklearn/compose/_column_transformer.py in _validate_transformers(self)
    396             return
--> 398         names, transformers, _ = zip(*self.transformers)
    400         # validate names
ValueError: not enough values to unpack (expected 3, got 2)
Thanks - I may have a look later - don’t have much time now.
Btw, totally unrelated to your problem, but this line is a bit “jarring”:
X = data.drop('Employed',axis=True)
axis should either be the integer 1 or the string “columns” here - makes the code nicer. In terms of styling, it’s also nicer to always put a space behind a comma - which really does make it easier to read. (Linters will also frown upon your styling.)
              No, prob. I don’t have direct experience with using these pipelines (just toyed with them quite a while ago). But this should not be too hard to debug for yourself. The stack trace unfortunately doesn’t tell in which transform the error is located, so isn’t that much help. But to debug, have you tried to
run everything after first commenting out use of the funct transform - just to make sure the mechanics of the rest are ok
apply funct directly to the training data, to verify it works as expected
double-checked that a transforming function can act in place (?!) since your funct is not returning anything
              Yes, tested all of the above and it works.

Because the trace doesn’t tell me where in the pipeline the failure is I’m lost.

Is there a different way to debug in Python? I find the trace mostly obscure and insufficient.
              I have often felt the same frustration. Still the stack trace does give some vital, perhaps usable info.
One way to proceed is step through the code in a Python debugger and identify which object is causing the  error.
$ python -m pdb --help
If you’re not familiar with that, it’s may have a bit of  learning curve but is not too difficult to use… (has built-in help and help on each of the built-in debug commands).

There is a pretty decent tutorial at Python Debugging With Pdb – Real Python

(In the distant past I’ve also use a GUI for Python debugging on Windows, but I’m not familiar with any current ones, so cannot give any hints for that. VSCode might made debugging easier.)
Another way, consider line 398, and consider the error message. There is a ValueError, expecting 3 objects (398 expects 3), but only 2 objects are given. So, this suggests, something is wrong with the self.transformers. That list or container needs to contain sub-lists (or tuples etc) of exactly 3 items, but it only got 2 in at least one of the sub-lists.

Also, this is coming (apparently) from the ColumnTransformer. Ok, then, without knowing anything more about underlying code, consider the way you defined that…

First try out my earlier suggestion: Remove your custom transformer from the ColumnTransformer to verify that the rest is OK. (Something else might go wrong, but as long as it happens later, that’s fine.) If so, then consider if you passed that in correctly.
There are other, quick-and-dirty ways to debug, but I think the second one will let you proceed…

Also take another look at the docs for ColumnTransformer (notes about the transformers argument).
(Not knowing any of these APIs, I do wonder, can the custom function be a function defined on a whole dataframe?? Should it not be a function that is only defined on a scalar or series? The example in the FunctionTransformer doc also suggests this. The fit function of the class will work on (certain cols of) a dataframe, but that does not imply that the defining function should/can take a dataframe.)
              Found the source of the error. I misunderstood the implementation of a Function Transformer. It wraps a function which gets applied to every value in a given dataframe colums. One passes the given columns to the transformer and it returns the new column values. In my implementation I assumed the underlying implementation provided unrestricted access to the entire dataframe - not true.

The second issue caused the error I saw. To execute the transformer in a pipeline, you pass it 3 parameters: 1. a descriptor of the step which can be used to identify the step.  2. The transformer to be executed at this step in the pipeline. 3. the columns of the dataframe to which the transformer must be applied.
I did not specify this third parameter for the Function Transformer and this caused the error. Once I corrected this I got a second error which indicated that the Function Transformer is broken.
So, my question is answered and now I need to relook my approach. But I guess that’s all in a day’s work.
I will certainly have a look at pdb, thank you for the tip.