Train your own LLM: Part1 Train the tokenizer

Javascript Model

Prepare the dataset

  1. Find all the javascript file inside a folder and summarize its total size.
1
find . -name "*.js" -print0 | xargs -0 du -sb | awk '{sum +=$1} END {print sum}'
  1. Create a small tool which convert file into json string to make the workflow simpler.
1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env node
const fs = require("fs");
const { Command } = require("commander");

const program = new Command();

program.command("jsonstr").action(() => {
const file = process.argv[3];
console.log(JSON.stringify(fs.readFileSync(file, "utf-8")));
});

program.parse();

a. Install dependencies

1
npm install commander

b. Create an alias to simplify the code

1
alias jstr='/home/xxxx/jstr.js jsonstr'
  1. Flat all the files into numbered sequence and put it in the folder.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash

output_dir=/var/tmp/js
# Get all the .js files in the current directory
js_files=( $(find . -type f -name "*.js") )
js_files_length=${#js_files[@]}

# Iterate over the .js files
for (( i = 0; i < ${#js_files[@]}; i++ )); do
# Get the file content
content=$(jstr "${js_files[i]}")

# Create a JSON file with the file content and name
echo "{ \"file\": \"${js_files[i]}\", \"data\": \"$content\" }" > "${output_dir}/$(($i + 1)).json"

# Print the progress.
printf "\r"
echo -n ${i}/${js_files_length}
done

Using spacy library

1
2
3
4
5
# Install spacy and download the library
pip install spacy

# Download the
python3 -m spacy download en_core_web_sm

After that you can run the follow code in the nodebook

1
2
3
4
5
6
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)

Here is the output

1
2
3
4
5
6
7
8
9
10
11
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj