AttributeError: 'DataFrame' object has no attribute 'cat' when running lgb.train() · Issue #6015

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

深沉的莴苣 · mongoose与时间相关的debug | ...· 3 月前 ·

痛苦的帽子 · 伊春市人民政府办公室关于印发伊春市森林康养旅 ...· 9 月前 ·

有情有义的匕首 · sosadfun废文网自由阅读安卓无弹窗下载 ...· 9 月前 ·

傻傻的豆浆 · Fastapi JWT 认证简单使用 - 知乎· 1 年前 ·

欢乐的小虾米 · 2023年3月份欧拉好猫销量3970台, ...· 1 年前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello guys. It's my first time using LightGBM and I have been stcuk on this issue for a couple of days. I'm trying to replicate a solution of a Kaggle competition and this solution uses LightGBM, the solution is from 2020 and although I haven't copied it exactly because I was getting several errors and warning this may be related. I already tried with the cat_feature parametewrs but the error persists, I can share more details and the code if this can help.

Thanks a lot in advance!

Thanks for using LightGBM.

We'd be happy to help, but need more information than this to help you without significant guessing. Please try to provide a minimal, reproducible example ( docs link ), code we can run which is as small as possible and which replicates the error you're encountering.

That should also include the information that was asked for when you clicked "new issue" here but which you didn't provide, like:

version of LightGBM you're using

how you installed it

interface you're using (Python? R? the CLI?)

Please also include the exact text of the error, so others who are struggling can find this conversation from search engines.

Thanks for the explanation, I attach a zip with two files that I think are good (and short) enough to replicate the problem.

example.zip

Please include the code and examples here, in plaintext in a comment, so they can be found by search engines and so anyone reading the discussion can understand what you're asking about. I'm sorry, but I won't open a random zip file.

If you're not familiar with formatting text on GitHub, see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for some help.

LightGBM version: 4.0.0

Python version: 3.9.13

Command for LGBM installation: pip install lightgbm

All this was coded in a Python Notebook:

# Basic imports
import numpy as np
import pandas as pd
import os, sys, gc, time, warnings, pickle, psutil, random
import multiprocessing 
from multiprocessing import Pool
warnings.filterwarnings('ignore')
# Seeder
def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)
# Model parameters
import lightgbm as lgb
lgb_params = {
                    'boosting_type': 'gbdt',
                    'objective': 'tweedie',
                    'tweedie_variance_power': 1.1,
                    'metric': 'rmse',
                    'subsample': 0.5,
                    'subsample_freq': 1,
                    'learning_rate': 0.015,
                    'num_leaves': 2**11-1,
                    'min_data_in_leaf': 2**12-1,
                    'feature_fraction': 0.5,
                    'max_bin': 100,
                    'n_estimators': 3000,
                    'boost_from_average': False,
                    'verbose': -1,
Don't pay much attention to all this variables
# Variables
STORE_IDS = ['CA_1','CA_2','CA_3','CA_4','TX_1','TX_2','TX_3','WI_1','WI_2','WI_3']
ver, KKK = 'priv', 0
VER = 1                          
SEED = 42                        
seed_everything(SEED)            
lgb_params['seed'] = SEED        
N_CORES = psutil.cpu_count()     
#LIMITS and const
TARGET      = 'sales'            
START_TRAIN = 0                  
END_TRAIN   = 1941 - 28*KKK      
P_HORIZON   = 28                 
USE_AUX     = False             
remove_features = ['id','state_id','store_id',
                   'date','wm_yr_wk','d',TARGET]
mean_features   = ['enc_cat_id_mean','enc_cat_id_std',
                   'enc_dept_id_mean','enc_dept_id_std',
                   'enc_item_id_mean','enc_item_id_std'] 
#SPLITS for lags creation
SHIFT_DAY  = 28
N_LAGS     = 15
LAGS_SPLIT = [col for col in range(SHIFT_DAY,SHIFT_DAY+N_LAGS)]
ROLS_SPLIT = []
for i in [1,7,14]:
    for j in [7,14,30,60]:
        ROLS_SPLIT.append([i,j])
#read grid_reduced
grid_reduced = pd.read_csv('grid_reduced.csv')
grid_reduced = grid_reduced.loc[:,~grid_reduced.columns.duplicated()] #delete duplicated columns
cat_features = []
for col in grid_reduced.columns: #proper tranformation of categorical features
  if grid_reduced[col].dtype.name == 'object' or grid_reduced[col].dtype.name == 'category':
    grid_reduced[col] = pd.Categorical(grid_reduced[col])
    cat_features.append(col)
train_mask = grid_reduced['d'].astype(int)<=END_TRAIN
valid_mask = train_mask&(grid_reduced['d'].astype(int)>(END_TRAIN-P_HORIZON))
preds_mask = (grid_reduced['d'].astype(int)>(END_TRAIN-100)) & (grid_reduced['d'].astype(int) <= END_TRAIN+P_HORIZON)
train_data = lgb.Dataset(grid_reduced[train_mask][features_columns], 
                         label=grid_reduced[train_mask][TARGET],
                         categorical_feature=cat_features)
valid_data = lgb.Dataset(grid_reduced[valid_mask][features_columns], 
                         label=grid_reduced[valid_mask][TARGET],
                         categorical_feature=cat_features)
Now comes the cell where the error happens:
seed_everything(SEED)
estimator = lgb.train(lgb_params,
                          train_data,
                          valid_sets = [valid_data],
                          categorical_feature=cat_features
Here's the grid_reduced csv, I hope you can download this kind of file, otherwise we can find a solution. Also sorry because this isn't excatly minimal but deleting some thing could have compromised other parts of the code and generated other erros.
grid_reduced.csv
          Thanks! We can look at that at some point
this isn't excatly minimal but deleting some thing could have compromised other parts of the code and generated other errors.
Right, that's why "minimal" is important in getting feedback. By not deleting things further, you're instead asking whoever is helping you to do that work instead.
Someone here will look into this when we have time. Anything you could do to further reduce the problem to something smaller would improve the likelihood that you get a timely and helpful answer.
          Sorry for the trouble and thanks for your help guys. To make a long story short, there were duplicate features/columns that were causing trouble. I don't think this post is useful anymore and maybe it's better to delete it all.
Thanks once again!