{"prompt":"Task: Find out what are the key topics in the document? output \"topic 1\", \"topic 2\", ... , \"topic n\".\n\nThe United States has withdrawn from the Paris Climate Agreement.\n\n","completion":"Topic 1<|endoftext|>"}{"prompt":"Given a list of numbers, find out if they are prime or not. you can answer \"prime\" or \"not prime\".\n\nInput: List: 1, 4, 6, 8, 9\n\nOutput:","completion":" not prime<|endoftext|>"}{"prompt":"Task: What are some of your favorite websites, and why do you visit them often?\n\n","completion":"- Stack Overflow - I like to learn new things, and this is a great place to find answers to questions.<|endoftext|>"}{"prompt":"Should i take this class or not?\nThe professor of this class is not good at all. He doesn't teach well and he is always late for class.\n\n","completion":"No<|endoftext|>"}{"prompt":"Given an array of integers, find two integers that sum up to a specific target integer.\n\n[1, 2, 3, 4]\nTarget: 7\n\n","completion":"5 and 2<|endoftext|>"}{"prompt":"Find out how many times does the word appear in the document. output the number of times.\nThe quick brown fox jumps over the lazy dog.\n\n","completion":"3<|endoftext|>"}{"prompt":"Rank these users according to their social influence.\nUser: The new movie is great. I would recommend it to anyone who likes action movies.\n\n","completion":"2<|endoftext|>"}
" Human: Can you describe the steps to clean fingerprints and smudges from a laptop screen
Assistant: Yes, certainly. To clean your screen, you first need to use a microfiber cloth or soft, damp cloth to gently wipe down the surface of the screen. Next, you’ll want to grab a soft, lint-free, microfiber cleaning cloth and gently rub it back and forth across the screen to remove fingerprints and smudges. Human: Can I spray isopropyl alcohol onto the cloth and clean it that way? Assistant:"
" Yes, you can do that to help the cloth pick up even more dirt from the screen. Be sure to always use a clean, soft cloth, not a piece of scratchy, roughened, or textured material, and make sure it’s lint-free."
" Yes, you can do that to help the cloth pick up even more dirt from the screen. Be sure to always use a clean, soft cloth, not a piece of scratchy, roughened, or textured material, and make sure it’s lint-free."
" Yes, you can spray it directly onto the cloth."
" Human: What are some foods that are good for diabetics? Assistant: To be honest, some of these are better than others, and they’re a little more like opinions than facts. For example, many of the diets say to limit vegetables with high sugar content, and there’s some debate on the subject, as far as how much of these vegetables are actually bad for diabetics. Human: Okay, any other advice? Assistant:"
" What exactly are you asking? There’s a lot of different kinds of diabetic diets. I could try to recommend you some specific foods and recipes. I could help you look up any of the foods, and I could find recipes for them."
" What exactly are you asking? There’s a lot of different kinds of diabetic diets. I could try to recommend you some specific foods and recipes. I could help you look up any of the foods, and I could find recipes for them."
" Sure, we’ve got information on common mistakes that diabetic patients make with their diets, and even some specific things to do when you eat out and on the go. One thing that’s recommended in these articles is just to be very mindful of the timing of food intake."
" Human: What animal would be the dominate life form on Earth if humans weren't here? Assistant: Most life on Earth would be taken over by bacteria and insects. Human: What about birds? Could they ever come to dominate the Earth? Assistant:"
" Possibly. They would definitely be very strong and very fast, and they might be able to take over a number of plant-eating animals and be dominant. That's a guess, I'm not sure what exactly a dominant lifeform would be."
" Possibly. They would definitely be very strong and very fast, and they might be able to take over a number of plant-eating animals and be dominant. That's a guess, I'm not sure what exactly a dominant lifeform would be."
" Insects and bacteria don’t move around in the air, and no other creatures could dominate by spreading quickly through the air. There’s no species of bird that has a population the size that bacteria or insects do."
" Human: How often are the Olympics? Assistant:"
" It is estimated that the Olympics occur every four years. Did that answer your question?"
" It is estimated that the Olympics occur every four years. Did that answer your question?"
fromtypingimportOptional,Tuple,Unionimporttorchimporttorch.nnasnnimporttorch.nn.functionalasFfrom..generationimportgeneratefrom..loraimportLoRAModulefrom..utilsimportlog_probs_from_logitsclassActor(LoRAModule):Actormodelbaseclass.Args:model(nn.Module):ActorModel.lora_rank(int):LoRArank.lora_train_bias(str):LoRAbiastrainingmode.def__init__(self,model:nn.Module,lora_rank:int=0,lora_train_bias:str='none')->None:super().__init__(lora_rank=lora_rank,lora_train_bias=lora_train_bias)self.model=modelself.convert_to_lora()@torch.no_grad()defgenerate(self,input_ids:torch.Tensor,return_action_mask:bool=True,**kwargs)->Union[Tuple[torch.LongTensor,torch.LongTensor],Tuple[torch.LongTensor,torch.LongTensor,torch.BoolTensor]]:sequences=generate(self.model,input_ids,**kwargs)attention_mask=Nonepad_token_id=kwargs.get('pad_token_id',None)ifpad_token_idisnotNone:attention_mask=sequences.not_equal(pad_token_id).to(dtype=torch.long,device=sequences.device)ifnotreturn_action_mask:returnsequences,attention_mask,Noneinput_len=input_ids.size(1)eos_token_id=kwargs.get('eos_token_id',None)ifeos_token_idisNone:action_mask=torch.ones_like(sequences,dtype=torch.bool)else:# left padding may be applied, only mask action
action_mask=(sequences[:,input_len:]==eos_token_id).cumsum(dim=-1)==0action_mask=F.pad(action_mask,(1+input_len,-1),value=True)#includeeostokenandinputaction_mask[:,:input_len]=Falseaction_mask=action_mask[:,1:]returnsequences,attention_mask,action_mask[:,-(sequences.size(1)-input_len):]defforward(self,sequences:torch.LongTensor,num_actions:int,attention_mask:Optional[torch.Tensor]=None)->torch.Tensor:"""Returns action log probsoutput=self.model(sequences,attention_mask=attention_mask)logits=output['logits']log_probs=log_probs_from_logits(logits[:,:-1,:],sequences[:,1:])returnlog_probs[:,-num_actions:]defget_base_model(self):returnself.modelclassLlamaActor(Actor):LlamaActor
model.Args:pretrained(str):Pretrainedmodelnameorpath.config(LlamaConfig):Modelconfig.checkpoint(bool):Enablegradientcheckpointing.lora_rank(int):LoRArank.lora_train_bias(str):LoRAbiastrainingmode.def__init__(self,pretrained:Optional[str]=None,config:Optional[LlamaConfig]=None,checkpoint:bool=False,lora_rank:int=0,lora_train_bias:str='none')->None:ifpretrainedisnotNone:model=LlamaForCausalLM.from_pretrained(pretrained)elifconfigisnotNone:model=LlamaForCausalLM(config)else:model=LlamaForCausalLM(LlamaConfig())ifcheckpoint:model.gradient_checkpointing_enable()super().__init__(model,lora_rank,lora_train_bias)
fromtypingimportOptionalimporttorchimporttorch.nnasnnfrom..loraimportLoRAModuleclassRewardModel(LoRAModule)
:Rewardmodelbaseclass.Args:model(nn.Module):Rewardmodel.value_head(nn.Module):Valueheadtogetrewardscore.lora_rank(int):LoRArank.lora_train_bias(str):LoRAbiastrainingmode.def__init__(self,model:nn.Module,value_head:Optional[nn.Module]=None,lora_rank:int=0,lora_train_bias:str='none')->None:super().__init__(lora_rank=lora_rank,lora_train_bias=lora_train_bias)self.model=modelself.convert_to_lora()ifvalue_headisnotNone:ifvalue_head.out_features!=1:raiseValueError("The value head of reward model's output dim should be 1!")self.value_head=value_headelse:self.value_head=nn.Linear(model.config.n_embd,1)defforward(self,sequences:torch.LongTensor,attention_mask:Optional[torch.Tensor]=None)->torch.Tensor:outputs=self.model(sequences,attention_mask=attention_mask)last_hidden_states=outputs['last_hidden_state']values=self.value_head(last_hidden_states)[:,:-1]value=values.mean(dim=1).squeeze(1)#ensureshapeis(B)returnvalueclassLlamaRM(RewardModel):LlamaRewardmodel.Args:pretrained(str):Pretrainedmodelnameorpath.config(LlamaConfig):Modelconfig.checkpoint(bool):Enablegradientcheckpointing.lora_rank(int):LoRArank.lora_train_bias(str):LoRAbiastrainingmode.def__init__(self,pretrained:Optional[str]=None,config:Optional[LlamaConfig]=None,checkpoint:bool=False,lora_rank:int=0,lora_train_bias:str='none')->None:ifpretrainedisnotNone:model=LlamaModel.from_pretrained(pretrained)elifconfigisnotNone:model=LlamaModel(config)else:model=LlamaModel(LlamaConfig())ifcheckpoint:model.gradient_checkpointing_enable()value_head=nn.Linear(model.config.hidden_size,1)value_head.weight.data.normal_(mean=0.0,std=1/(model.config.hidden_size+1))super().__init__(model,value_head,lora_rank,lora_train_bias)# This software may be used and distributed according to the terms of the GNU General Public License version 3.
fromtypingimportOptional,Tuplefromdataclassesimportdataclassimportmathimporttorchfromtorchimportnnimporttorch.nn.functionalasFimportfairscale.nn.model_parallel.initializeasfs_initfromfairscale.nn.model_parallel.layersimport(ParallelEmbedding,RowParallelLinear,ColumnParallelLinear,@dataclassclassModelArgs:dim:int=512n_layers:int=8n_heads:int=8vocab_size:int=-1#definedlaterbytokenizermultiple_of:int=256#makeSwiGLUhiddenlayersizemultipleoflargepowerof2norm_eps:float=1e-5max_batch_size:int=32max_seq_len:int=2048classRMSNorm(torch.nn.Module):def__init__(self,dim:int,eps:float=1e-6):super().__init__()self.eps=epsself.weight=nn.Parameter(torch.ones(dim))def_norm(self,x):returnx*torch.rsqrt(x.pow(2).mean(-1,keepdim=True)+self.eps)defforward(self,x):output=self._norm(x.float()).type_as(x)returnoutput*self.weightdefprecompute_freqs_cis(dim
:int,end:int,theta:float=10000.0):freqs=1.0/(theta**(torch.arange(0,dim,2)[:(dim// 2)].float() / dim))
t=torch.arange(end,device=freqs.device)#type:ignorefreqs=torch.outer(t,freqs).float()#type:ignorefreqs_cis=torch.polar(torch.ones_like(freqs),freqs)#complex64returnfreqs_cisdefreshape_for_broadcast(freqs_cis:torch.Tensor,x:torch.Tensor):ndim=x.ndimassert0<=1<ndimassertfreqs_cis.shape==(x.shape[1],x.shape[-1])shape=[difi==1ori==ndim-1else1fori,dinenumerate(x.shape)]returnfreqs_cis.view(*shape)defapply_rotary_emb(xq:torch.Tensor,xk:torch.Tensor,freqs_cis:torch.Tensor,)->Tuple[torch.Tensor,torch.Tensor]:xq_=torch.view_as_complex(xq.float().reshape(*xq.shape[:-1],-1,2))xk_=torch.view_as_complex(xk.float().reshape(*xk.shape[:-1],-1,2))freqs_cis=reshape_for_broadcast(freqs_cis,xq_)xq_out=torch.view_as_real(xq_*freqs_cis).flatten(3)xk_out=torch.view_as_real(xk_*freqs_cis).flatten(3)returnxq_out.type_as(xq),xk_out.type_as(xk)classAttention(nn.Module):def__init__(self,args:ModelArgs):super().__init__()self.n_local_heads=args.n_heads// fs_init.get_model_parallel_world_size()
self.head_dim=args.dim// args.n_heads
self.wq=ColumnParallelLinear(args.dim,args.n_heads*self.head_dim,bias=False,gather_output=False,init_method=lambdax:x,self.wk=ColumnParallelLinear(args.dim,args.n_heads*self.head_dim,bias=False,gather_output=False,init_method=lambdax:x,self.wv=ColumnParallelLinear(args.dim,args.n_heads*self.head_dim,bias=False,gather_output=False,init_method=lambdax:x,self.wo=RowParallelLinear(args.n_heads*self.head_dim,args.dim,bias=False,input_is_parallel=True,init_method=lambdax:x,self.cache_k=torch.zeros((args.max_batch_size,args.max_seq_len,self.n_local_heads,self.head_dim)).cuda()self.cache_v=torch.zeros((args.max_batch_size,args.max_seq_len,self.n_local_heads,self.head_dim)).cuda()defforward(self,x:torch.Tensor,start_pos:int,freqs_cis:torch.Tensor,mask:Optional[torch.Tensor]):bsz,seqlen,_=x.shapexq,xk,xv=self.wq(x),self.wk(x),self.wv(x)xq=xq.view(bsz,seqlen,self.n_local_heads,self.head_dim)xk=xk.view(bsz,seqlen,self.n_local_heads,self.head_dim)xv=xv.view(bsz,seqlen,self.n_local_heads,self.head_dim)xq,xk=apply_rotary_emb(xq,xk,freqs_cis=freqs_cis)self.cache_k=self.cache_k.to(xq)self.cache_v=self.cache_v.to(xq)self.cache_k
[:bsz,start_pos:start_pos+seqlen]=xkself.cache_v[:bsz,start_pos:start_pos+seqlen]=xvkeys=self.cache_k[:bsz,:start_pos+seqlen]values=self.cache_v[:bsz,:start_pos+seqlen]xq=xq.transpose(1,2)keys=keys.transpose(1,2)values=values.transpose(1,2)#计算未归一化的注意力分数
scores=torch.matmul(xq,keys.transpose(2,3))/math.sqrt(self.head_dim)ifmaskisnotNone:scores=scores+mask#(bs,n_local_heads,slen,cache_len+slen)#计算归一化的注意力分数
scores=F.softmax(scores.float(),dim=-1).type_as(xq)#将V使用注意力分数加权获得输出表示
output=torch.matmul(scores,values)#(bs,n_local_heads,slen,head_dim)output=output.transpose(1,2).contiguous().view(bsz,seqlen,-1)returnself.wo(output)classFeedForward(nn.Module):def__init__(self,dim:int,hidden_dim:int,multiple_of:int,super().__init__()hidden_dim=int(2*hidden_dim/3)hidden_dim=multiple_of*((hidden_dim+multiple_of-1)// multiple_of)
self.w1=ColumnParallelLinear(dim,hidden_dim,bias=False,gather_output=False,init_method=lambdax:xself.w2=RowParallelLinear(hidden_dim,dim,bias=False,input_is_parallel=True,init_method=lambdax:xself.w3=ColumnParallelLinear(dim,hidden_dim,bias=False,gather_output=False,init_method=lambdax:xdefforward(self,x):returnself.w2(F.silu(self.w1(x))*self.w3(x))classTransformerBlock(nn.Module):def__init__(self,layer_id:int,args:ModelArgs):super().__init__()self.n_heads=args.n_headsself.dim=args.dimself.head_dim=args.dim// args.n_heads
self.attention=Attention(args)self.feed_forward=FeedForward(dim=args.dim,hidden_dim=4*args.dim,multiple_of=args.multiple_ofself.layer_id=layer_idself.attention_norm=RMSNorm(args.dim,eps=args.norm_eps)self.ffn_norm=RMSNorm(args.dim,eps=args.norm_eps)defforward(self,x:torch.Tensor,start_pos:int,freqs_cis:torch.Tensor,mask:Optional[torch.Tensor]):h=x+self.attention.forward(self.attention_norm(x),start_pos,freqs_cis,mask)out=h+self.feed_forward.forward(self.ffn_norm(h))returnoutclassTransformer(nn.Module):def__init__(self,params:ModelArgs):super().__init__()self.params=paramsself.vocab_size=params.vocab_sizeself.n_layers=params.n_layersself.tok_embeddings=ParallelEmbedding(params.vocab_size,params.dim,init_method=lambdax:xself.layers=torch.nn.ModuleList()forlayer_idinrange(params.n_layers):self.layers.append(TransformerBlock(layer_id,params))self.norm=RMSNorm(params.dim,eps=params.norm_eps)self.output=ColumnParallelLinear(params.dim,params.vocab_size,bias=False,init_method=lambdax:xself.freqs_cis=
precompute_freqs_cis(self.params.dim// self.params.n_heads, self.params.max_seq_len * 2
@torch.inference_mode()defforward(self,tokens:torch.Tensor,start_pos:int):_bsz,seqlen=tokens.shapeh=self.tok_embeddings(tokens)self.freqs_cis=self.freqs_cis.to(h.device)freqs_cis=self.freqs_cis[start_pos:start_pos+seqlen]mask=Noneifseqlen>1:mask=torch.full((1,1,seqlen,seqlen),float("-inf"),device=tokens.device)mask=torch.triu(mask,diagonal=start_pos+1).type_as(h)forlayerinself.layers:h=layer(h,start_pos,freqs_cis,mask)h=self.norm(h)output=self.output(h[:,-1,:])#onlycomputelastlogitsreturnoutput.float()
fromtypingimportAny,Callable,Dict,List,Optionalimporttorchimporttorch.nnasnnfromcoati.experience_makerimportExperience,NaiveExperienceMakerfromcoati.models.baseimportActor,Criticfromcoati.models.generation_utilsimportupdate_model_kwargs_fnfromcoati.models.lossimportPolicyLoss,ValueLossfromcoati.replay_bufferimportNaiveReplayBufferfromtorch.optimimportOptimizerfromtransformers.tokenization_utils_baseimportPreTrainedTokenizerBasefrom.baseimportTrainerfrom.callbacksimportCallbackfrom.strategiesimportStrategyclassPPOTrainer(Trainer):TrainerforPPOalgorithm.Args:strategy(Strategy):thestrategytousefortrainingactor(Actor):theactormodelinppoalgorithmcritic(Critic):thecriticmodelinppoalgorithmreward_model(nn.Module):therewardmodelinrlhfalgorithmtomakerewardofsentencesinitial_model(Actor):theinitialmodelinrlhfalgorithmtogeneratereferencelogitstolimittheupdateofactoractor_optim(Optimizer):theoptimizertouseforactormodelcritic_optim(Optimizer):theoptimizertouseforcriticmodelkl_coef(float,defaultsto0.1):thecoefficientofkldivergencelosstrain_batch_size(int,defaultsto8):thebatchsizetousefortrainingbuffer_limit(int,defaultsto0):themax_sizelimitaitonofreplaybufferbuffer_cpu_offload(bool,defaultstoTrue):whethertooffloadreplaybuffertocpueps_clip(float,defaultsto0.2):theclipcoefficientofpolicylossvalue_clip(float,defaultsto0.4):theclipcoefficientofvaluelossexperience_batch_size(int,defaults
to8):thebatchsizetouseforexperiencegenerationmax_epochs(int,defaultsto1):thenumberofepochsoftrainingprocesstokenier(Callable,optional):thetokenizertousefortokenizingtheinputsample_replay_buffer(bool,defaultstoFalse):whethertosamplefromreplaybufferdataloader_pin_memory(bool,defaultstoTrue):whethertopinmemoryfordataloadercallbacks(List[Callback],defaultsto[]):thecallbackstocallduringtrainingprocessgenerate_kwargs(dict,optional):thekwargstousewhilemodelgeneratingdef__init__(self,strategy:Strategy,actor:Actor,critic:Critic,reward_model:nn.Module,initial_model:Actor,actor_optim:Optimizer,critic_optim:Optimizer,kl_coef:float=0.1,ptx_coef:float=0.9,train_batch_size:int=8,buffer_limit:int=0,buffer_cpu_offload:bool=True,eps_clip:float=0.2,value_clip:float=0.4,experience_batch_size:int=8,max_epochs:int=1,tokenizer:Optional[Callable[[Any], dict]]=None,sample_replay_buffer:bool=False,dataloader_pin_memory:bool=True,callbacks:List[Callback]=[],**generate_kwargs)->None:experience_maker=NaiveExperienceMaker(actor,critic,reward_model,initial_model,kl_coef)replay_buffer=NaiveReplayBuffer(train_batch_size,buffer_limit,buffer_cpu_offload)generate_kwargs=_set_default_generate_kwargs(strategy,generate_kwargs,actor)super().__init__(strategy,experience_maker,replay_buffer,experience_batch_size,max_epochs,tokenizer,sample_replay_buffer,dataloader_pin_memory,callbacks,**generate_kwargs)self.actor=actorself.critic=criticself.actor_loss_fn=PolicyLoss(eps_clip)self.critic_loss_fn=ValueLoss(value_clip)self.ptx_loss_fn=nn.CrossEntropyLoss(ignore_index=-100)self.ptx_coef=ptx_coefself.actor_optim=actor_optimself.critic_optim=critic_optimdeftraining_step(self,experience:Experience)->Dict[str,float]:self.actor.train()self.critic.train()# policy loss
num_actions=experience.action_mask.size(1)action_log_probs=self.actor(experience.sequences,num_actions,attention_mask=experience.attention_mask)actor_loss=self.actor_loss_fn(action_log_probs,experience.action_log_probs,experience.advantages,action_mask=experience.action_mask)# ptx loss
ifself.ptx_coef!=0:ptx=next(iter(self.pretrain_dataloader))['input_ids'].to(torch.cuda.current_device())label=next(iter(self.pretrain_dataloader))['labels'].to(torch.cuda.current_device())[:,1:]attention_mask=next(iter(self.pretrain_dataloader))['attention_mask'].to(torch.cuda.current_device())ptx_log_probs=self.actor.get_base_model()(ptx,attention_mask=attention_mask)['logits'][...,:-1,:]ptx_loss=self.ptx_loss_fn(ptx_log_probs.view(-1,ptx_log_probs.size(-1)),label.view(-1))actor_loss=ptx_loss*self.ptx_coef+actor_loss*(1-self.ptx_coef)self.strategy.backward(actor_loss,self.actor,self.actor_optim)self.strategy.optimizer_step(self.actor_optim)self.actor_optim.zero_grad()# value loss
values=self.critic(experience.sequences,action_mask=experience.action_mask,attention_mask=experience.attention_mask)critic_loss=self.critic_loss_fn(values,experience.values,experience.reward,action_mask=experience.action_mask)self.strategy.backward(critic_loss,self.critic,self.critic_optim)self.strategy.optimizer_step(self.critic_optim)self.critic_optim.zero_grad()return{'reward':experience.reward.mean().item()}def_set_default_generate_kwargs(strategy:Strategy,generate_kwargs:dict,actor:Actor)->None:origin_model=strategy._unwrap_actor(actor)new_kwargs={**generate_kwargs}# use huggingface models method directly
if'prepare_inputs_fn'notingenerate_kwargsandhasattr(origin_model,'prepare_inputs_for_generation'):new_kwargs['prepare_inputs_fn']=origin_model.prepare_inputs_for_generationif'update_model_kwargs_fn'notingenerate_kwargs:new_kwargs['update_model_kwargs_fn']=update_model_kwargs_fnreturnnew_kwargsdefsave_model(self,path:str,only_rank0:bool=False,tokenizer:Optional[PreTrainedTokenizerBase]=None)->None:self.strategy.save_model(model=self.actor,path=path,only_rank0=only_rank0,tokenizer=tokenizer)