添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello guys. It's my first time using LightGBM and I have been stcuk on this issue for a couple of days. I'm trying to replicate a solution of a Kaggle competition and this solution uses LightGBM, the solution is from 2020 and although I haven't copied it exactly because I was getting several errors and warning this may be related. I already tried with the cat_feature parametewrs but the error persists, I can share more details and the code if this can help.

Thanks a lot in advance!

Thanks for using LightGBM.

We'd be happy to help, but need more information than this to help you without significant guessing. Please try to provide a minimal, reproducible example ( docs link ), code we can run which is as small as possible and which replicates the error you're encountering.

That should also include the information that was asked for when you clicked "new issue" here but which you didn't provide, like:

  • version of LightGBM you're using
  • how you installed it
  • interface you're using (Python? R? the CLI?)
  • Please also include the exact text of the error, so others who are struggling can find this conversation from search engines.

    Thanks for the explanation, I attach a zip with two files that I think are good (and short) enough to replicate the problem.

    example.zip

    Please include the code and examples here, in plaintext in a comment, so they can be found by search engines and so anyone reading the discussion can understand what you're asking about. I'm sorry, but I won't open a random zip file.

    If you're not familiar with formatting text on GitHub, see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for some help.

    LightGBM version: 4.0.0

    Python version: 3.9.13

    Command for LGBM installation: pip install lightgbm

    All this was coded in a Python Notebook:

    # Basic imports
    import numpy as np
    import pandas as pd
    import os, sys, gc, time, warnings, pickle, psutil, random
    import multiprocessing 
    from multiprocessing import Pool
    warnings.filterwarnings('ignore')
    
    # Seeder
    def seed_everything(seed=0):
        random.seed(seed)
        np.random.seed(seed)
    # Model parameters
    import lightgbm as lgb
    lgb_params = {
                        'boosting_type': 'gbdt',
                        'objective': 'tweedie',
                        'tweedie_variance_power': 1.1,
                        'metric': 'rmse',
                        'subsample': 0.5,
                        'subsample_freq': 1,
                        'learning_rate': 0.015,
                        'num_leaves': 2**11-1,
                        'min_data_in_leaf': 2**12-1,
                        'feature_fraction': 0.5,
                        'max_bin': 100,
                        'n_estimators': 3000,
                        'boost_from_average': False,
                        'verbose': -1,
    

    Don't pay much attention to all this variables

    # Variables
    STORE_IDS = ['CA_1','CA_2','CA_3','CA_4','TX_1','TX_2','TX_3','WI_1','WI_2','WI_3']
    ver, KKK = 'priv', 0
    VER = 1                          
    SEED = 42                        
    seed_everything(SEED)            
    lgb_params['seed'] = SEED        
    N_CORES = psutil.cpu_count()     
    #LIMITS and const
    TARGET      = 'sales'            
    START_TRAIN = 0                  
    END_TRAIN   = 1941 - 28*KKK      
    P_HORIZON   = 28                 
    USE_AUX     = False             
    remove_features = ['id','state_id','store_id',
                       'date','wm_yr_wk','d',TARGET]
    mean_features   = ['enc_cat_id_mean','enc_cat_id_std',
                       'enc_dept_id_mean','enc_dept_id_std',
                       'enc_item_id_mean','enc_item_id_std'] 
    #SPLITS for lags creation
    SHIFT_DAY  = 28
    N_LAGS     = 15
    LAGS_SPLIT = [col for col in range(SHIFT_DAY,SHIFT_DAY+N_LAGS)]
    ROLS_SPLIT = []
    for i in [1,7,14]:
        for j in [7,14,30,60]:
            ROLS_SPLIT.append([i,j])
    
    #read grid_reduced
    grid_reduced = pd.read_csv('grid_reduced.csv')
    
    grid_reduced = grid_reduced.loc[:,~grid_reduced.columns.duplicated()] #delete duplicated columns
    cat_features = []
    for col in grid_reduced.columns: #proper tranformation of categorical features
      if grid_reduced[col].dtype.name == 'object' or grid_reduced[col].dtype.name == 'category':
        grid_reduced[col] = pd.Categorical(grid_reduced[col])
        cat_features.append(col)
    
    train_mask = grid_reduced['d'].astype(int)<=END_TRAIN
    valid_mask = train_mask&(grid_reduced['d'].astype(int)>(END_TRAIN-P_HORIZON))
    preds_mask = (grid_reduced['d'].astype(int)>(END_TRAIN-100)) & (grid_reduced['d'].astype(int) <= END_TRAIN+P_HORIZON)
    train_data = lgb.Dataset(grid_reduced[train_mask][features_columns], 
                             label=grid_reduced[train_mask][TARGET],
                             categorical_feature=cat_features)
    valid_data = lgb.Dataset(grid_reduced[valid_mask][features_columns], 
                             label=grid_reduced[valid_mask][TARGET],
                             categorical_feature=cat_features)
    

    Now comes the cell where the error happens:

    seed_everything(SEED)
    estimator = lgb.train(lgb_params,
                              train_data,
                              valid_sets = [valid_data],
                              categorical_feature=cat_features
    

    Here's the grid_reduced csv, I hope you can download this kind of file, otherwise we can find a solution. Also sorry because this isn't excatly minimal but deleting some thing could have compromised other parts of the code and generated other erros.

    grid_reduced.csv

    Thanks! We can look at that at some point

    this isn't excatly minimal but deleting some thing could have compromised other parts of the code and generated other errors.

    Right, that's why "minimal" is important in getting feedback. By not deleting things further, you're instead asking whoever is helping you to do that work instead.

    Someone here will look into this when we have time. Anything you could do to further reduce the problem to something smaller would improve the likelihood that you get a timely and helpful answer.

    Sorry for the trouble and thanks for your help guys. To make a long story short, there were duplicate features/columns that were causing trouble. I don't think this post is useful anymore and maybe it's better to delete it all.

    Thanks once again!