2024-04-01

Train your own LLM: Part1 Train the tokenizer

Javascript Model

Prepare the dataset

Find all the javascript file inside a folder and summarize its total size.

1	find . -name "*.js" -print0 \| xargs -0 du -sb \| awk '{sum +=$1} END {print sum}'

Create a small tool which convert file into json string to make the workflow simpler.

#!/usr/bin/env node
const fs = require("fs");
const { Command } = require("commander");

const program = new Command();

program.command("jsonstr").action(() => {
  const file = process.argv[3];
  console.log(JSON.stringify(fs.readFileSync(file, "utf-8")));
});

program.parse();

a. Install dependencies

1	npm install commander

b. Create an alias to simplify the code

1	alias jstr='/home/xxxx/jstr.js jsonstr'

Flat all the files into numbered sequence and put it in the folder.

#!/bin/bash

output_dir=/var/tmp/js
# Get all the .js files in the current directory
js_files=( $(find . -type f -name "*.js") )
js_files_length=${#js_files[@]}

# Iterate over the .js files
for (( i = 0; i < ${#js_files[@]}; i++ )); do
  # Get the file content
  content=$(jstr "${js_files[i]}")

  # Create a JSON file with the file content and name
  echo "{ \"file\": \"${js_files[i]}\", \"data\": \"$content\" }" > "${output_dir}/$(($i + 1)).json"

  # Print the progress.
  printf "\r"
  echo -n ${i}/${js_files_length}
done

Using spacy library

# Install spacy and download the library
pip install spacy

# Download the
python3 -m spacy download en_core_web_sm

After that you can run the follow code in the nodebook

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Here is the output

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj