Jan 12, 2024
GPT-2 Large Language Model ပဵၼ် Generative AI ဢၼ်ၶူင်သၢင်ႈလႄႈပိုၼ်ၽႄႈလူၺ်ႈ OpenAI။ GPT-2 ႁဵတ်းႁႂ်ႈမီးလွင်ႈလႅၵ်ႈလၢႆႈယႂ်ႇလူင်ၼႂ်းၶၵ်ႉၵၢၼ် NLP (Natural Language Language) မၼ်းၸၢင်ႈ generate ၶူင်ဢွၵ်ႇၶေႃႈၵႂၢမ်းဢၼ်မိူၼ်ၵူၼ်းၸႂ်ႉတိုဝ်း၊ ပိၼ်ႇၽႃႇသႃႇ၊ ႁုပ်ႈထွႆႈၵႂၢမ်း လႄႈထႅင်ႈတင်းၼမ်။
GPT-2 (Generative Pre-trained Transformer 2) ဢၼ်ပိုၼ်ၽႄႈမႃးၼၼ်ႉပဵၼ် pre-trained model ဢၼ်ၸႂ်ႉၶေႃႈမုၼ်း train 40GB (a very large corpus) 8 million web pages.
ၵူၺ်းၵႃႈ ၼႂ်းၶေႃႈမုၼ်းဢၼ်ၸႂ်ႉ train ၸိူဝ်းၼၼ်ႉပဵၼ်ၶေႃႈမုၼ်းၽႃႇသႃႇ english လၢႆလၢႆ ၼႂ်းပွင်ႈၵႂၢမ်းႁူဝ်ၼႆႉၸိုင်တေမႃးဢွၼ်ၸၢမ်းတူၺ်းလွၵ်းလၢႆးၸႂ်ႉၶေႃႈမုၼ်းၽႃႇသႃႇတႆးသေ fine-tune gpt-2 ႁႂ်ႈမၼ်း generate ၽႃႇသႃႇတႆးလႆႈ။
ယိူင်းမၢႆထီႉ 1 ၶွင် GPT-2 တႃႇႁႂ်ႈမၼ်း generate text ဢွၵ်ႇမႃး ဢိင်ၼိူဝ်ၶေႃႈၵႂၢမ်း prompt ဢၼ်ပၼ် input ၶဝ်ႈ။
တႃႇတေႁဵတ်းႁႂ်ႈ Computer ဢမ်ႇၼၼ် AI ပွင်ႇၸႂ် context လိၵ်ႈသေ ၶိုၼ်းၶူင်တႅမ်လိၵ်ႈသိုပ်ႈတေႃႇၵၼ်ၵႂႃႇၼၼ်ႉ မိူဝ်ႈၵွၼ်ႇယၢမ်းၸႂ်ႉတိုဝ်းလွၵ်းလၢႆးတၢင်းၼမ် သိုပ်ႇလူ - AI/Computer ႁဵတ်းၸိူင်ႉႁိုဝ်သေၸင်ႇပွင်ႇၵႂၢမ်းၵူၼ်းလႆႈ
probabilistic/statistic model လွၵ်းလၢႆးၵဝ်ႇၸိူဝ်းၼၼ်ႉမီးပၼ်ႁႃမႃး ပေႃးဝႃႈၶေႃႈၵႂၢမ်းထႅဝ်လိၵ်ႈမၼ်းယၢဝ်းမႃးတိၵ်းတိၵ်း မၼ်းၸၢင်ႈလိုမ်းၼႃႈလိုမ်းလင် လႄႈၸႂ်ႉ resource တွၼ်ႈတႃႇတေတွင်း context ၶေႃႈၵႂၢမ်းတင်းၼမ်။
Transformer technique ၸင်ႇၵိူတ်ႇပဵၼ်မႃး တႃႇၵႄႈပၼ်ႁႃဢၼ်ထူပ်းယူႇၸိူဝ်းၼၼ်ႉ သိုပ်ႇလူ - Attention Is All You Need
GPT-2 ၵေႃႈပဵၼ် transformer-base။
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.
GPT-2, GPT-3 Large Language model ၸိူဝ်းၼႆႉ တေလႆႈၸႂ်ႉႁႅင်း Computer/GPU/TPU ၼမ်ႉတႄႉတႄႉဝႃႈဝႃႈသေၸင်ႇၵွႆႈ train ပဵၼ် AI model ဢွၵ်ႇမႃးလႆႈ။
GPT-2 ၸႂ်ႉၶၢဝ်းယၢမ်း GPU 100,000 ၸူဝ်ႈမွင်း ပွင်ႇဝႃႈပေႃးၸႂ်ႉ GPU 1 ႁူၺ်ႇသေ train တေလႆႈၸႂ်ႉၶၢဝ်းယၢမ်း 11 ပီ။
ၵႃႈၶၼ်ငိုၼ်းတႃႇၸႂ်ႉၼႂ်းၵၢၼ် train gpt-2 40,000US dollar လႄႈ GPT-3 ၶိုၼ်ႈၵႂႃႇပဵၼ် 4.6 လၢၼ်ႉ US dollar။
ၵွပ်ႈၼၼ်လႄႈ ပေႃးဝႃႈဢမ်ႇၸႂ်ႈၶွမ်ႊပၼီႊယႂ်ႇ"လူင်" တႄႉတိုၼ်းဝႃႈဢမ်ႇပဵၼ်လႆႈၼၼ်ႉယဝ်ႉ။
ၵူၺ်းၵႃႈၼႂ်းဝၼ်းမိူဝ်ႈလဵဝ်ၼႆႉ LLM pre-trained model ဢွၵ်ႇမႃးဝႆႉတင်းၼမ် ဢၼ်ႁဝ်းၶိုၼ်းမႃး fine-tune လူၺ်ႈၵၢၼ်သႂ်ႇၶေႃႈမုၼ်းလိၵ်ႈလၢႆးႁဝ်း ႁႂ်ႈၸႂ်ႉၸွမ်း model မႂ်ႇ"ၸိူဝ်းၼႆႉလႆႈလႄႈ ဢၼ်ၵႃႈႁဝ်းႁဵတ်းလႆႈၵေႃႈ ပဵၼ်ၵၢၼ်ႁႃပၼ်ၶေႃႈမုၼ်းလိၵ်ႈၼမ်ၼမ်သေ fine-tune ၵႂႃႇၵေႃႈၸႂ်ႉလႆႈလီငၢမ်းယူႇၶႃႈ။
တႃႇတေ fine-tuning LLM model ၸိူဝ်းၼႆႉၸႂ်ႉႁႅင်း processing တင်းၼမ် ပေႃးဝႃႈၸႂ်ႉ CPU လၢႆလၢႆတေၸႂ်ႉၶၢမ်းယၢမ်းတင်းႁိုင် ပေႃးဝႃႈၸႂ်ႉ CPU 10HR ၼႆ ၼိူဝ် GPU ၸၢင်ႈလူတ်းလိူဝ် 30min - 1HR။
တီႈၶႃႈတႄႉဢမ်ႇမီးဝႆႉ GPU လႄႈ တေၸႂ်ႉ Google Colab တူဝ် free မီး VRAM (GPU ram) ဢမ်ႇပဵင်းပေႃးလႄႈ တေလႆႈၸႂ်ႉ Colab Pro (A100 Nvidia)
သိုဝ်ႉ Pro သေတႃႉၵေႃႈ လႆႈၸႂ်ႉ A100 ဢမ်ႇပူၼ်ႉသေ သၢမ်ပွၵ်ႈ lol
ႁဝ်းတေ mount google drive တွၼ်ႈတႃႇၵဵပ်းသိမ်း data ဝႆႉ ၵွပ်ႈဝႃႈ google colab ၼႆႉပေႃး kill process ၵႂႃႇၼႆ ၶေႃႈမုၼ်းမၼ်းတေႁၢႆၵႂႃႇတင်းမူတ်း။
လႄႈ login hugging face ဝႆႉတွၼ်ႈတႃႇၵဵပ်းသိမ်း model ဝႆႉၼိူဝ် hugging face။
from google.colab import drive drive.mount('/content/drive')
from huggingface_hub import notebook_login notebook_login()
Install dependency
# requirements GPU !pip install accelerate -U !pip install transformers[torch] !pip install datasets
Optional for TPU ပေႃးဝႃႈၸႂ်ႉ TPU
# requirement TPU !pip install google-api-python-client>=1.12.5 !pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-2.0-cp310-cp310-linux_x86_64. !pip install --upgrade torch torch-xla !pip install --upgrade transformers
Checking available CUDA
!nvidia-smi
GPT-2 model တေၵတ်ႉၶႅၼ်ႇ ၶိုၵ်ႉၶႅမ်ႉၵႃႈႁိုဝ်ၼၼ်ႉ ဢိင်ၼိူဝ်ၶေႃႈမုၼ်းမီးၼမ်ၵႃႈႁိုဝ်ၼၼ်ႉယဝ်ႉၶႃႈ တူဝ် pre-trained ၶဝ်ၸႂ်ႉဝႆႉ text data 40GB (ၶေႃႈမုၼ်း website 8 လၢၼ်ႉ ၼႃႈ) ၼႆလႄႈ တွၼ်ႈတႃႇတေႁႂ်ႇမၼ်းႁူႉၸၵ်း လႄႈ generate လိၵ်ႈတႆးလႆႈၼၼ်ႉ ႁဝ်းၵေႃႈလူဝ်ႇပွၼ်ႈၶေႃႈမုၼ်းပၼ်မၼ်းႁႂ်ႈၼမ်တီႈသုတ်းတီႈႁႃလႆႈ။
ႁဝ်းတေ focus ၼိူဝ် generative field ၼႆလႄႈ Text data ဢၼ်လူဝ်ႇၼၼ်ႉ ပဵၼ် text မဵဝ်းႁိုဝ်ၵေႃႈလႆႈ တႅမ်ႈလွင်ႈသင်ၵေႃႈလႆႈ ၵူၺ်းၵႃႈ တီႈဢေႇသုတ်း တူဝ်ၽိတ်းတူဝ်ထုၵ်ႇၸိူဝ်းၼၼ်ႉ လူဝ်ႇလႆႈၶျဵၵ်ႉသေမႄးပၼ်ႁႂ်ႈထုၵ်ႇမႅၼ်ႈၸွမ်းပိူင်လိၵ်ႈလၢႆး။
ၵွပ်ႈၼၼ်လႄႈၶႃႈလိူၵ်ႈၸႂ်ႉၶေႃႈမုၼ်းၼႂ်းဝဵပ်ႉသၢႆႉဢၼ်ယုမ်ႇလႆႈဝႃႈ တေမီးတူဝ်ၽိတ်းဢေႇတီႈသုတ်းမိူၼ်ၼင်ႇဝဵပ်ႉသၢႆႉၸိူဝ်းၼႆႉ
ပေႃးဝႃႈမီးဝဵပ်ႉသၢႆႉဢၼ်ၸႂ်ႉတိုဝ်းလိၵ်ႈတႆး ၼွၵ်ႈလိူဝ်ၼႆႉထႅင်ႈၸိုင် ၸွႆႈသူင်ႇပၼ်ၽွင်ႈၶႃႈ
ၶေႃႈမုၼ်းဢၼ်လႆႈမႃးၸိူဝ်းၼႆႉပဵၼ်ၶေႃႈမုၼ်းဢၼ်ပိုၼ်ၽႄႈ api ဝႆႉတင်းမူတ်း ဝၢႆးသေ ၸၼ်ၶေႃႈမုၼ်း (.csv) ဝႆႉယဝ်ႉ လူဝ်ႇႁဵတ်း data cleaning ဢဝ်ဢွၵ်ႇပႅတ်ႈတူဝ်ဢၼ်ဢမ်ႇၸႂ်ႈတူဝ်လိၵ်ႈ မိူၼ်ၼင်ႇ တူဝ် "\n" "\t" လႄႈ HTML tags ၸိူဝ်းၼႆႉ။
clean .csv data
import pandas as pd from bs4 import BeautifulSoup import html # read data file file = "./shan_data.csv" df = pd.read_csv(file) content = df.content # fill NaN content field with title df.loc[df['content'].isna(), 'content'] = df.loc[df['content'].isna(), 'title'] # remove all html tags def remove_html_tags(contents): html_clean = html.unescape(contents) soup = BeautifulSoup(html_clean, "html.parser") soup = soup.get_text().replace("\n", " ") return soup content = content.apply(remove_html_tags) df['content'] = content # remove empty space df['content'] = df['content'].str.strip() df['content'] = df.apply(lambda row: row['title'] if len(row['content']) == 0 else row['content'], axis=1) # save to csv df.to_csv(f"cleaned_data/{file}_cleaned.csv", index=False) print("Finished.")
ဝၢႆးသေ cleaned ၶေႃႈမုၼ်းယဝ်ႉ တေလႆႈႁဵတ်းၶေႃႈမုၼ်းပဵၼ်သွင်တွၼ်ႈ 1. တွၼ်ႈတႃႇ train လႄႈ 2. တွၼ်ႈတႃႇ validate လွင်ႈမၢၼ်ႇမႅၼ်ႈမၼ်း။
import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv("./raw_data/shannews.org.csv", encoding='utf-8') contents = df['content'] def save_combined_text_to_file(data, output_file): combined_text = "" for content in data: combined_text += content + "<|endoftext|>" + "\n" with open(output_file, "w", encoding="utf-8") as output: output.write(combined_text) print("Combined text saved to: ", output_file) train_text, valid_text = train_test_split(contents, test_size=0.2, random_state=42) train_file = "./txt_data/shannews/train_data.txt" valid_file = "./txt_data/shannews/valid_data.txt" save_combined_text_to_file(train_text, train_file) save_combined_text_to_file(valid_text, valid_file) print("Training data saved to: ", train_file) print("Validation data saved to: ", valid_file)
Tokenization ပဵၼ်ၶၵ်ႉတွၼ်ႈလမ်ႇလွင်ႈတွၼ်ႈတႃႇၵၢၼ် NLP ၵူႈဢၼ်ဢၼ် tokenization ပဵၼ်ၵၢၼ်တတ်းထႅဝ်လိၵ်ႈဢွၵ်ႇပဵၼ်ၶေႃႈ” ဢၼ်မီးတီႈပွင်ႇၵႂၢမ်းမၢႆ tokenizer မီးဝႆႉလၢႆမဵဝ်း မိူၼ်ၼင်ႇ Dictionary base tokenization (တတ်းၶေႃႈၵႂၢမ်းဢဝ်ၼႂ်း dictionary တူၺ်း -> ShanNLP)၊ Byte-Pair Encoding (BPE tokenization)၊ WordPiece tokenization၊ Unigram tokenization၊ Sentencepiece tokenizer ၸဵမ်ၸိူဝ်းၼႆႉ။
ၼႂ်း GPT-2 တေၸႂ်ႉလွၵ်းလၢႆး BPE tokenization။
တႃႇႁႂ်ႈၸႂ်ႉ tokenizer ႁဝ်းလႆႈၸွမ်း GPT-2 ႁဝ်းတေ train tokenizer လူၺ်ႈ GPT2Tokenizer လႄႈ extended gpt-2 tokenizer ၸွမ်းၼင်ႇ Code ၼႆႉ။
from tokenizers import (decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer) from transformers import GPT2Tokenizer, GPT2TokenizerFast, GPT2Model, GPT2LMHeadModel from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments new_tokenizer = Tokenizer(models.BPE()) new_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"]) train_file = '/content/drive/MyDrive/gpt2labs/shannews/shannews_datasets.txt' new_tokenizer.train([train_file], trainer=trainer) new_tokenizer.post_processor = processors.ByteLevel(trim_offsets=False) new_tokenizer.decoder = decoders.ByteLevel() new_tokenizer = GPT2TokenizerFast(tokenizer_object=new_tokenizer) new_tokenizer.save_pretrained("shannews_bpe_tokenizer") new_tokenizer # save tokenizer to huggingface hub new_tokenizer.push_to_hub("shannews_bpe_tokenizer") # gpt2 tokenizer gpt2_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") print(len(gpt2_tokenizer.get_vocab())) gpt2_tokenizer # extended tokenizer vocab_tokens = list(new_tokenizer.get_vocab()) decoded_tokens = [new_tokenizer.decoder.decode([token]) for token in vocab_tokens] print(len(vocab_tokens), len(decoded_tokens)) gpt2_tokenizer.add_tokens(decoded_tokens) gpt2_tokenizer.save_pretrained("shannews_extened_tokenizer_gpt2") print(len(gpt2_tokenizer.get_vocab())) gpt2_tokenizer # save extended tokenizer to huggingface hub gpt2_tokenizer.push_to_hub("shannews_bpe_extened_tokenizer")
ဝၢႆးသေမီး Tokenizer ယဝ်ႉ ၵမ်းၼႆႉတေၸတ်းၵၢၼ် test, train data ႁဝ်း၊ တတ်းၶေႃႈၵႂၢမ်းလူၺ်ႈ tokenizer လႄႈၸတ်းၶေႃႈၵႂၢမ်းပဵၼ်ၵွၼ်ႈ” ဢၼ်ႁွင်ႉဝႃႈ chunks တွၼ်ႈတႃႇပွၼ်ႈသွၼ်ပၼ် machine။
import os import torch import time from datasets import Dataset, DatasetDict from tokenizers import (decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer) from transformers import GPT2Tokenizer, GPT2TokenizerFast, GPT2Model, GPT2LMHeadModel from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments # load the tokenizer from huggingface tokenizer = GPT2TokenizerFast.from_pretrained("NorHsangPha/shannews_bpe_extened_tokenizer") tokenizer.pad_token = tokenizer.eos_token print(tokenizer.vocab_size) print(len(tokenizer)) tokenizer.save_pretrained("/tokenizer/shannews_bpe_extened_tokenizer") # fine-tuning train_file = "/content/drive/MyDrive/gpt2labs/shannews/train_data.txt" valid_file = "/content/drive/MyDrive/gpt2labs/shannews/valid_data.txt" with open(train_file, 'r', encoding='utf-8') as f: train_data = f.readlines() with open(valid_file, 'r', encoding='utf-8') as f: valid_data = f.readlines() print(len(train_data), len(valid_data)) # dataset object train_dataset = Dataset.from_dict({"text": train_data}) valid_dataset = Dataset.from_dict({"text": valid_data}) def preprocess_function(examples): out = tokenizer(examples["text"]) return out # apply tokenization to dataset train_dataset = train_dataset.map(preprocess_function, batched=True, num_proc=4, remove_columns=train_dataset.column_names) valid_dataset = valid_dataset.map(preprocess_function, batched=True, num_proc=4, remove_columns=valid_dataset.column_names) tokenized_datasets = DatasetDict({"train": train_dataset, "valid": valid_dataset}) # group tokenized datasets to blocks block_size = 128 def group_texts(examples): concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) if total_length >= block_size: total_length = (total_length // block_size) * block_size result = { k: [t[i : i + block_size] for i in range(0, total_length, block_size)] for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() return result # group tokenized train dataset lm_train_dataset = tokenized_datasets['train'].map(group_texts, batched=True, num_proc=4) # group tokenized valid dataset lm_valid_dataset = tokenized_datasets['valid'].map(group_texts, batched=True, num_proc=4) lm_dataset = DatasetDict({"train": lm_train_dataset, "valid": lm_valid_dataset}) lm_dataset # Data collator for language modeling data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) out = data_collator([lm_dataset['train'][i] for i in range(5)]) for key in out: print(f"{key} shape: {out[key].shape}")
တွၼ်ႈၼႆႉႁဝ်းတေ load gpt-2 model တွၼ်ႈတႃႇ fine-tune လႄႈ optimize ဢိတ်းဢွတ်း တွၼ်ႈတႃႇတေၸွႆႈႁႂ်ႈမၼ်း train ဝႆးၶိုၼ်ႈၵမ်ႈၽွင်ႈ (ပေႃးဝႃႈၽႂ်မီ GPU ၼမ်/ႁႅင်း ၼႆၶၢမ်ႈတွၼ်ႈၼႆႉၵေႃႈလႆႈ)
from transformers import GPT2LMHeadModel device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load the GPT2 model model = GPT2LMHeadModel.from_pretrained("gpt2").to(device) # Resize the model's embeddings def find_multiple(n: int, k: int) -> int: if n % k == 0: return n return n + k - (n % k) new_embeddings_size = find_multiple(len(tokenizer), 64) model.resize_token_embeddings(new_embeddings_size) freeze_layers = False if freeze_layers: for name, param in model.named_parameters(): if 'transformer.wte' in name: param.requires_grad = True else: param.requires_grad = False
ၵမ်းလိုၼ်းသုတ်းၵေႃႈတေႁဵတ်း Fine-tuning gpt-2 model လူၺ်ႈ transformer။
parameters ၸိူဝ်းၼႆႉ setup တႃႇ train ၼိူဝ် google colab pro A100 GPU 40GB of RAMS သင်ဝႃႈမီးႁႅင်း GPU လႄႈၶၢဝ်းယၢမ်းဢၼ်ၸႂ်ႉ GPU လႆႈႁိုင်ၼႆ ၸၢင်ႈလႅၵ်ႈ parameters မိူၼ်ၼင်ႇ batch_size ၶိုၼ်ႈထႅင်ႈလႆႈ 32, 64, 128 တေႁဵတ်းႁႂ်ႈ model ႁဵၼ်းႁူႉလႆႈလီၶိုၼ်ႈထႅင်ႈ ၵူၺ်းၵႃႈ ၵေႃႈတေၸႂ်ႉႁႅင်းလႄႈၶၢဝ်းယၢမ်း GPU ႁိုင်မႃးထႅင်ႈ။
from transformers import Trainer, TrainingArguments args = TrainingArguments( output_dir="/content/shannews_gpt2/", overwrite_output_dir=True, per_device_train_batch_size=32, per_device_eval_batch_size=32, evaluation_strategy="steps", eval_steps=500, logging_steps=500, gradient_accumulation_steps=4, num_train_epochs=1, weight_decay=0.1, warmup_steps=500, lr_scheduler_type="cosine", learning_rate=5e-4, save_steps=500, push_to_hub=True, save_total_limit=2 ) # Initialize the Trainer trainer = Trainer( model=model, args=args, data_collator=data_collator, train_dataset=lm_train_dataset, eval_dataset=lm_valid_dataset, ) # Start training trainer.train() # Save the fine-tuned model trainer.save_model("shannews_gpt2")
ဝၢႆးသေ train model ယဝ်ႉ ၸၢမ်းၸႂ်ႉ generate တူၺ်း။
from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch_device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("NorHsangPha/shannews_bpe_extened_tokenizer") model = AutoModelForCausalLM.from_pretrained("NorHsangPha/shannews_gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device) text = "တပ်ႉသိုၵ်းၸိုင်ႈတႆး" # set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3 sample_outputs = model.generate( **model_inputs, max_new_tokens=40, no_repeat_ngram_size=2, do_sample=True, top_k=50, top_p=0.95, num_return_sequences=3, ) print("Output:\n" + 100 * '-') for i, sample_output in enumerate(sample_outputs): print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True))) # Output: ---------------------------------------------------------------------------------------------------- # 0: တပ်ႉသိုၵ်းၸိုင်ႈတႆး (ဝၼ်းပိူင်မၢႆမီႈမီး တူၵ်ႇတႅၵ်ႈ (မႄႈၸႄႈဝဵင်းၵျွၵ်ႉမႄးၼၼ်ႉ တေ # 1: တပ်ႉသိုၵ်းၸိုင်ႈတႆး တပ်ႉၵွင် ၸိုဝ်ႈ ၶၢဝ်းယၢမ်း TNLA ဢွၵ်ႇဝႆႉၼင်ႇၼႆ။ ၵူၼ်းဝၢၼ်ႈၵူတ # 2: တပ်ႉသိုၵ်းၸိုင်ႈတႆး RCSS/SSA လုၵ်ႈၸၢႆး ၵေးသီးလူင် ၸဵင်ႇတူၼ်ႈတၼ်းသႂ်ႇၶွင်ႊသီႊသႅၼ်ႊပ
ဝၢႆးသေလဵပ်ႈႁဵၼ်းလႄႈၸၢမ်း trained, test model ယဝ်ႉၼၼ်ႉ တေႁၼ်ဝႃႈ မၼ်း generate ဢွၵ်ႇပၼ်လိၵ်ႈတႆးလႆႈပဵၼ်ၶေႃႈယူႇသေတႃႉ လွင်ႈတီႈပွင်ႇၵႂၢမ်းမၢႆထႅဝ်လိၵ်ႈ ယင်းပႆႇၵဵဝ်ႇၶွင်ႈၵၼ်လီ ဢၼ်ၼႆႉပဵၼ်လႆႈလၢႆတီႈ
မိူဝ်းၼႃႈၶၢဝ်းယၢဝ်း တႃႇတေသိုပ်ႈႁဵတ်းႁႂ်ႈၸႂ်ႉ AI model မိူၼ်ၼင်ႇ LLM, Generative AI တွၼ်ႈတႃႇလိၵ်ႈတႆးလႆႈလီလီၼၼ်ႉယင်းလူဝ်ႇတၢင်းၸွႆႈထႅမ်ႁူမ်ႈမိုဝ်းၵၼ်လၢႆပႃႈလၢႆၾၢႆႇမိူၼ်ၼင်ႇ
ႁႅင်းၶေႃႈမုၼ်းလိၵ်ႈတႆး ႁႅင်းငိုၼ်းတွင်းတႃႇၸႂ်ႉ train ၸိူဝ်းၼႆႉယူႇၶႃႈ။