Hello guys. It's my first time using LightGBM and I have been stcuk on this issue for a couple of days. I'm trying to replicate a solution of a Kaggle competition and this solution uses LightGBM, the solution is from 2020 and although I haven't copied it exactly because I was getting several errors and warning this may be related. I already tried with the cat_feature parametewrs but the error persists, I can share more details and the code if this can help.
We'd be happy to help, but need more information than this to help you without significant guessing. Please try to provide a minimal, reproducible example (
docs link
), code we can run which is as small as possible and which replicates the error you're encountering.
That should also include the information that was asked for when you clicked "new issue" here but which you didn't provide, like:
Please also include the exact text of the error, so others who are struggling can find this conversation from search engines.
Thanks for the explanation, I attach a zip with two files that I think are good (and short) enough to replicate the problem.
Please include the code and examples here, in plaintext in a comment, so they can be found by search engines and so anyone reading the discussion can understand what you're asking about. I'm sorry, but I won't open a random zip file.
# Basic imports
import numpy as np
import pandas as pd
import os, sys, gc, time, warnings, pickle, psutil, random
import multiprocessing
from multiprocessing import Pool
warnings.filterwarnings('ignore')
# Seeder
def seed_everything(seed=0):
random.seed(seed)
np.random.seed(seed)
# Model parameters
import lightgbm as lgb
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'tweedie',
'tweedie_variance_power': 1.1,
'metric': 'rmse',
'subsample': 0.5,
'subsample_freq': 1,
'learning_rate': 0.015,
'num_leaves': 2**11-1,
'min_data_in_leaf': 2**12-1,
'feature_fraction': 0.5,
'max_bin': 100,
'n_estimators': 3000,
'boost_from_average': False,
'verbose': -1,
Don't pay much attention to all this variables
# Variables
STORE_IDS = ['CA_1','CA_2','CA_3','CA_4','TX_1','TX_2','TX_3','WI_1','WI_2','WI_3']
ver, KKK = 'priv', 0
VER = 1
SEED = 42
seed_everything(SEED)
lgb_params['seed'] = SEED
N_CORES = psutil.cpu_count()
#LIMITS and const
TARGET = 'sales'
START_TRAIN = 0
END_TRAIN = 1941 - 28*KKK
P_HORIZON = 28
USE_AUX = False
remove_features = ['id','state_id','store_id',
'date','wm_yr_wk','d',TARGET]
mean_features = ['enc_cat_id_mean','enc_cat_id_std',
'enc_dept_id_mean','enc_dept_id_std',
'enc_item_id_mean','enc_item_id_std']
#SPLITS for lags creation
SHIFT_DAY = 28
N_LAGS = 15
LAGS_SPLIT = [col for col in range(SHIFT_DAY,SHIFT_DAY+N_LAGS)]
ROLS_SPLIT = []
for i in [1,7,14]:
for j in [7,14,30,60]:
ROLS_SPLIT.append([i,j])
#read grid_reduced
grid_reduced = pd.read_csv('grid_reduced.csv')
grid_reduced = grid_reduced.loc[:,~grid_reduced.columns.duplicated()] #delete duplicated columns
cat_features = []
for col in grid_reduced.columns: #proper tranformation of categorical features
if grid_reduced[col].dtype.name == 'object' or grid_reduced[col].dtype.name == 'category':
grid_reduced[col] = pd.Categorical(grid_reduced[col])
cat_features.append(col)
train_mask = grid_reduced['d'].astype(int)<=END_TRAIN
valid_mask = train_mask&(grid_reduced['d'].astype(int)>(END_TRAIN-P_HORIZON))
preds_mask = (grid_reduced['d'].astype(int)>(END_TRAIN-100)) & (grid_reduced['d'].astype(int) <= END_TRAIN+P_HORIZON)
train_data = lgb.Dataset(grid_reduced[train_mask][features_columns],
label=grid_reduced[train_mask][TARGET],
categorical_feature=cat_features)
valid_data = lgb.Dataset(grid_reduced[valid_mask][features_columns],
label=grid_reduced[valid_mask][TARGET],
categorical_feature=cat_features)
Now comes the cell where the error happens:
seed_everything(SEED)
estimator = lgb.train(lgb_params,
train_data,
valid_sets = [valid_data],
categorical_feature=cat_features
Here's the grid_reduced csv, I hope you can download this kind of file, otherwise we can find a solution. Also sorry because this isn't excatly minimal but deleting some thing could have compromised other parts of the code and generated other erros.
grid_reduced.csv
Thanks! We can look at that at some point
this isn't excatly minimal but deleting some thing could have compromised other parts of the code and generated other errors.
Right, that's why "minimal" is important in getting feedback. By not deleting things further, you're instead asking whoever is helping you to do that work instead.
Someone here will look into this when we have time. Anything you could do to further reduce the problem to something smaller would improve the likelihood that you get a timely and helpful answer.
Sorry for the trouble and thanks for your help guys. To make a long story short, there were duplicate features/columns that were causing trouble. I don't think this post is useful anymore and maybe it's better to delete it all.
Thanks once again!