Train your own LLM: Part1 Train the tokenizer

Javascript Model

Prepare the dataset

  1. Find all the javascript file inside a folder and summarize its total size.
1
find . -name "*.js" -print0 | xargs -0 du -sb | awk '{sum +=$1} END {print sum}'
  1. Create a small tool which convert file into json string to make the workflow simpler.
1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env node
const fs = require("fs");
const { Command } = require("commander");

const program = new Command();

program.command("jsonstr").action(() => {
const file = process.argv[3];
console.log(JSON.stringify(fs.readFileSync(file, "utf-8")));
});

program.parse();

a. Install dependencies

1
npm install commander

b. Create an alias to simplify the code

1
alias jstr='/home/xxxx/jstr.js jsonstr'
  1. Flat all the files into numbered sequence and put it in the folder.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash

output_dir=/var/tmp/js
# Get all the .js files in the current directory
js_files=( $(find . -type f -name "*.js") )
js_files_length=${#js_files[@]}

# Iterate over the .js files
for (( i = 0; i < ${#js_files[@]}; i++ )); do
# Get the file content
content=$(jstr "${js_files[i]}")

# Create a JSON file with the file content and name
echo "{ \"file\": \"${js_files[i]}\", \"data\": \"$content\" }" > "${output_dir}/$(($i + 1)).json"

# Print the progress.
printf "\r"
echo -n ${i}/${js_files_length}
done

Using spacy library

1
2
3
4
5
# Install spacy and download the library
pip install spacy

# Download the
python3 -m spacy download en_core_web_sm

After that you can run the follow code in the nodebook

1
2
3
4
5
6
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
print(token.text, token.pos_, token.dep_)

Here is the output

1
2
3
4
5
6
7
8
9
10
11
Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN dobj
startup NOUN dep
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj

Tips to using bash script

Append text at the beginning of files, example:

  1. The css or scss file imported from 3rd party, will put all the class at the root level, even if the import code is inside a class scope. That cause lots of issue, the simplest workaround is to append the class scope at all CSS file begin and end.
  • Append a class at the beginning:
1
find . -name "*.css" | xargs -I {} sed -i '1s/^/.myclass{\n/' {}
  • Append a text at the end of the end of the file
1
find . -name "*.css" -print0 | xargs -0 -I{} sh -c 'echo "\n}" >> "{}"'
  • Check the network port
1
netstat -tulpn

Impage Processing with Halide [WIP]

What is Halide?

Halide is a programming language that make it easier to write high performance image and array processing code on modern machines. WebSite

When I first learn Halide, wait a minute, is this thing real? After watching a few Yotube videos, I must see I wish I have heard it 3 years earlier. It is never too late to learn something new, and there is always a new way to solve existing problem. How to learn a new framework, the best way is to use it solve a existing problem.

Problem to Solve

Vibrance: Make image vivid while avoid to over saturate the image.

WebGL Filter Playground

As a software engineer

Being a software engineer, what can I do to help my family?

Problem

Family sharing - As computer science continue revolutionary changed our life, more and more of our life will be occupied by screens. And it also changed how family communicating together. We are all connected to each other no matter where you are, it should not be blocked by politics or distance.

I want to share the latest photo of my kids to their grand parents, usually it can be done be Google photos. While Google photos can not be accessed from China. I am a developer, and I should be able to solve this problem. I build a family sharing service to share with my parents on AWS with serverless architecture. The cost is only 4 dollars per month for everything, including SSL and domain. It was the best thing. Why I did not choose 3rd party service, because of the security reasons.

Design

The main idea is to copy image from Google photos and store it in AWS and view it from AWS. Album is stored as a JSON file, no need to use database (cost effective, right?).

PlantUML

When I worked for Microsoft as a contractor, I always tend to choose tools from Microsoft. I used viso has my primary tool to draw sequence diagrams, workflow diagrams. I was 100% using Mac as my primary computer start from 2014, I need to find a good tool that I can use it in a long run. When I worked at Autodesk, I was using Draw.io to do the drawings, and it worked OK.

One day, I decide to find some tool for my personal use, I found PlantUML, and I fallen in love with it immediately. I think the main reason is I am a developer which I tend to decribe the world in a logic way, and I would like to draw conditions like programming and explain complex things in an intuitive way.

Is UML worth the time?

Yes, even for a solo developer like me. There is always a vision in my mind of what feature I want to build and how to build it, I still find UML help me to clear my mind especially when there is a debate about a feature. And I can only develop things when I have spare time (when kids are asleep!).

Here is an example of a project I will build down the road.

A photo / video viewer for my family.

Conclusion

You can see, generate an diagram from code is awesome, isn’t it? Please give it a try. Huge thanks to the opensource project PlantUML.

TaskQueue

A very simple but powerful utility class I used all over. It gives us a chance to do things in paralle and still can wait for the result to come.

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/**
* Promise based TaskQueue
*/
type OnTaskDone = (queue: TaskQueue) => void;
type Task = () => Promise<any>;

class TaskQueue {
concurrent = 1;
tasks: Array<Task> = [];
errors: Array<Error> = [];
runningTask = 0;
name = "";
onTaskDone: OnTaskDone | null;
log: Function;

constructor(concurrent = 1, onTaskDone: OnTaskDone, name: string, showLog = false) {
this.concurrent = concurrent;
this.tasks = [];
this.errors = [];
this.runningTask = 0;
this.name = name || "TaskQueue";
this.onTaskDone = onTaskDone;

if (showLog) {
this.log = console.log;
} else {
this.log = () => {};
}
}

addTask(task: Task) {
this.tasks.push(task);
}

reset() {
this.errors.length = 0;
this.tasks.length = 0;
this.runningTask = 0;
}

status() {
if (this.tasks.length == 0 && this.errors.length == 0 && this.runningTask == 0) {
return "finished";
} else if (this.errors.length > 0) {
return "failed";
} else if (this.runningTask > 0 || this.tasks.length > 0) {
return "running";
}
}

/**
* Task will be executed by the sequence added in, but not garantee that previous
* job is finished.
*/
async execute() {
if (this.status() == "finished" && this.onTaskDone) {
this.onTaskDone(this);
this.onTaskDone = null;
}

while (this.runningTask < this.concurrent && this.tasks.length > 0) {
let task = this.tasks.shift();

this.runningTask++;
this.log(this.name, "remain", this.tasks.length, "running", this.runningTask);

try {
await task!();
} catch (error) {
this.errors.push(error as Error);
} finally {
this.runningTask--;
this.execute();

this.log(this.name, "remain", this.tasks.length, "running", this.runningTask);
if (this.runningTask == 0 && this.tasks.length == 0 && this.onTaskDone) {
this.onTaskDone(this);
}
}
}
}
}
export default TaskQueue;

API Anywhere

Introduction

I have build a super complex server and client structure in my spare time to serve image sharing between my family. And I have a drawing app which is rely heavely on web workers to boost the performance. Since I am a solo developer for all of this, communication and documentation is never a problem. Plain Javascript and webpack works really well. Until a point I have more and more API need to be designed and more and more different message needs to send to worker. How can I make this intuitive and straightforwared? I stopped and thinked about this problem? With my past experiences at Microsoft, I know C# and reflection really well, and Typescript decorator is the saver for this. After going through all more past experiences, I think there must have a better way to solve this. I decide to invent one for myself. I call it API-Anywhere. My ultimate goal is to treat API as a function(which it really is). I will explain it with more details.

Problems

  • API Documentation (self explained)
  • Async functions running somewhere - web worker, backend server or host app (mobile)
  • API Parameter was constraint by interface
  • Full IDE integration
  • Easy to adopt
  • Minimum refactoring
  • Support partial migration

Typescript vs Javascript, which one is better?

I think there is nothing wrong with Javascript, though some people thought it is hard to master. Peosonaly, I would like to use Javascript to do the development because when I debug it, I want to see the same code at the break point.

While it has already been compromised by Babel run time generators. There are so many code generated to serve async/await magic.

Typescript is just a syntax sugar for us. We still need to master how javascript works overall and make better use of it.

Because I decide to refactor all my code with API Anywhere achitecture, that means some part of my code will transfer to Typescript. Because I like the idea of decorator which is similar to annotation in Java/C#.

Achitecture diagram

API Anywhere design

Quan(Kevin) Li

Key Skills

  1. 16 years working experience in Computer Science Industry
  2. Skill set covered server-side programming, client-side programming and mobile programming (Android, HTML5, NodeJS, Javascript)
  3. Having strong background of building high scalable 3D geometry processing webservices
  4. Good at providing software solution; quick learner on business logic and good communication skills
  5. Programming Language:Master in Java, Javascript(Typescript), WebGL, HTML5, C++
  6. Web Development: NodeJS, JSON, Web Service, React and Angular
  7. Database: Expert in Relational database (MySQL, PostgreSQL)

About myself

I am Quan, most people call me Kevin. I have been working in software development world for 16 years. Most of my daily job is to play with code and think about what is best for the product, work closely to the teammates to help them advance their technical skills.

Build highly scalable 3D geometry webservices which can handle millions of requests per day.

Prototype and designed Codeblocks product for Tinkercad, my role in this app is architect. Because we are a very small team, I implemented most part of the code. It is a standalone application, it has backend, dashboard, 3d view and blocks editor. The innovation part of this app is the animation. I show the timeline of how the mode is built, the transformation and the state transition. And I also built half of the starter to help user find a start point. As a developer, we know it is hard and frustrate for user start coding. And I worked hard on the blocks syntax to try to minimum the chance of error. This one is most proud app I have built so far.

3D bricks space for Tinkercad. We know people love Lego bricks and would like to build things using Lego bricks. What if we can develop an algorithm to convert any 3d model to bricks. It was a fun project for me since I have never play with bricks before. After I implemented the first version of the algorithm, I bought a box of bricks from Lego store in the u-pick area. And the first thing I build is hot air balloon. It was so cool, and I had lots of fun. It is all client-side solution and runs in user’s browser, and I tried with very huge design, it also worked. In order to export huge STL/OBJ model, I rewrote the STL/OBJ Exporter to leverage the buffer in html5.

I have released two very successful Android application: Pixlr Express for Android, Instructables for Android. Pixlr won google best Android App of 2012. (yeah)

There are so many things I can write, well let’s stop here.

Work Experience

Google LLC

Software Engineer (Linux Kernel Release)

2021/10 - Present

Autodesk

Software Architect, Autodesk, San Francisco, California, US

2018/04 – 2021/10

Web Graphics System, Large Model Viewer – Autodesk Forge Platform

  • High level design and implement 2D graphics for fattest PDF viewing experience
  • Coach team graphics technologies

Tinkercad Project

Jun 2016 - April 2018

Autodesk main online education platform for STEM students. It provides easy to use online 3D design and Electronic Circuits tools for millions of kids worldwide. https://www.tinkercad.com

  • Responsible for the whole architecture of the Tinkercad website, web services and 3d design editor
  • Manage the Tinkercad development team to coordinate the product direction and needs
  • Code review for the whole team, coach teammates to help them grow in the company, organize technical debate when there is tough technical decisions to make
  • Performance probe to make sure Tinkercad can handle at least 10K concurrent users, Tinkercad userbase grows from 2.6 million to about 11 million without increase AWS cloud cost, while provided better performances
  • Integrate with Autodesk Fusion 360 to build the funnel for student move on to more advanced 3D/Simulation tools
  • Rebuild the search engine for the website, adapt the web service infrastructure to auto scaling architect, which saves hundred of thousands dollars per year
  • Build the Tinkercad Codeblocks editor to teach user to do procedure modeling with Tinkercad platform, won The EdTech Awards 2019
  • Architect the automation framework against Tinkercad 3D editor
  • Prototype and Release Tinkercad Part Feature (June, 2017)
  • Build tinkercad Codeblocks feature
  • Unify backend geometry services for V1 and V2 system (April - May, 2017)
  • Design and implement V2 API system to support session stickless and autoscaling (April, 2017)
  • Migrate V1 system thumbnail rendering pipeline and integrate it with V2 method
  • Design and implement URL signature to enable the access control to assets (April, 2017)
  • Design and implement the API-Reader system (January - March, 2017)
  • Massive memory leak and performance fix in Tinkercad Editor (January - March, 2017)
  • Build backend CSG services to help Tinkercad centralize the geometry services and reduce the cost of AWS
  • Change application logic to build auto scalable application infrastructure to help company reduce the cost
  • Do detail memory leak analyze and decouple GC references to fix the memory leak in client side
  • Change the GLSL shader to create correct visual effect
  • Add batch operation control to boost the application performance
  • Create AWS lambda functions to monitor S3 changes, split current API-Server with new infrastructure to off load the server pressure without re-engineer the logic, change the server with more scalable infrastructure

Principle Engineer, Autodesk (ACG), Toronto, Canada

2016/04 – 2016/06

Senior Software Engineer, Autodesk (ACG), Toronto, Canada

2014/06 – 2016/04

123D Tinkercad Project [May 2015 – June 2016]

  • Work closely with Tinkercad global team to build next generation of Tinkercad platform by redefining the backend infrastructure and web application infrastructure, review and coach team members, help them to grow their professional skills in Autodesk
  • Work with Lagoa Team in Montreal, Canada and ASM Team in Cambridge UK to prototype for the Tinkercad roadmap project – Project Marta, the bridge between consumer 3D application to professional modeling and simulation software Autodesk Fusion 360
  • Build highly scalable CSG service by using NodeJS wrap Gen6 and make it as a cloud webservice which can provide excellent performance and velocity
  • Fix memory issues with Gen6 code and change the architect to build the intermediate calculation result cache
  • Build SketchTool for EZHome Project (Based on BREP kinds of data structure)
  • Build the main work plane and improved the visual effects
  • Build solid boolean server which is a c++ library need to be wrapped by Chrome V8
  • Fix memory issues in Tinkercad Editor and boost the loading process with asynchronized load experiences
  • Build thumbnail generation service which is using povray to render the 3D scene
  • SVG import and export with advanced 2d Boolean logic
  • Build OBJ/STL file exporter with NodeJS

123D Sculpt+ (Android) Project

[June 2014 – April 2015]

  • Sculpt project is a 3D modeling software, which is the upgrade version of 123D Creature. This project has lots of technology, including C++, Lua, OpenGL, Java, JNI, all kinds of reflection. I am responsible to implement some C++ API for the framework and integrate Android native UI with OpenGL UI, which was controlled by C++ and HUD system.
  • Build Android UI and control the transition animations
  • Build custom user control, provide smooth user experience
  • Investigate the build script to make the JNI code run faster with compiler options
  • Create some LUA script to connect with Java and other eco system.

Senior Software Development Engineer, Autodesk (ACG), ACRD, China

2010/09 – 2014/05

Instructables Project: Instructables for Android

[Aug 2013 – May 2014]

  • I am the only developer for this project from Aug 2013. We completely revamped this application from the version 1.3. Fixed lots of crashes, memory leaks, and logical issues.
  • Successfully released version 1.2, 1.3, 1.4, 1.5, 2.0
  • Rebuild the whole UI, workflow and creation logic for version 1.3
  • Build localization support for 6 languages
  • Created day/night mode to improve the viewing experience
  • Analyze the memory leaks with MAT(Memory Analyze Tool for eclipse) to fix almost all memory leaks, which helps us to run more than half million times of Android Monkey testing to detect memory issue or possible crashes. Our application is now more stable than ever.
  • Refined the data mode for better data binding and extensions, which helps me to minimize the data adapting and code extensions.

Pixlr Project: Pixlr Express For Android

[Jun 2012 – Aug 2013]

  • Key contributor of Pixlr Express For Android. Pixlr Express For Andoid was a photo editing application which run on android platform, support on SDK 2.2+. It was released on Nov 2012, and won Google Best Apps at the end of 2012. [https://play.google.com/store/apps/collection/promotion_3000068_best_apps]
  • Build robust UI to match all platform dynamically and fluent animations
  • Implemented Vibrance/Touch Up/Focal Blur/Color Splash/Denoise/Whiten/History Brush/Doodle
  • Performance tuning for Whiten/Rotate/Saving
  • Investigated Ads platforms [not release]
  • Developed UI component for the application: Slider/Color Picker/Color Palette/Value Tile
  • Developed post animation mechanism for editing tools
  • Developed vibrance feature for Pixlr Editor(Web)
  • Codebase update for Pixlr-O-Matic
  • Collage feature for Pixlr Express (web and android)
  • Build real-time video effect for Pixlr-O-Matic2 with OpenGL and GLSL

HomeStyler Project

[Sep 2010 - Mar 2012]

  • Implemented new rules for room and wall interactions(modeling improvement), great improved user experiences when create a room and walls
  • Built pure 3D viewer based on Adobe Molehill API [POC]
  • Introduced new 360 panorama to our product, this feature will be released in Oct 2011 as a main feature for our product, we have built viewer for all platforms including iPhone/iPad using HTML5
  • Developed state management system for HomeStyler designer, design profile for web and mobile platform
  • Built new content engine for catalog and integrate with other furniture companies
  • Build new content service engine with CakePHP, Mysql
  • Database tuning, the performance has been improved for more than 20 times
  • Debug and fixed the server side memory leak
  • Built many automation tools to make the development work easier

Chinasoft

SDETII Chinasoft Resource Co., Ltd (Shanghai & Beijing)

2007/02 – 2010/08

  • Member of Visual Studio Core IDE team, in this team we maintain and implemented Microsoft Automation User Interface framework (VB.Net)
  • Use Windows Debugger to diagnose manage heap of Visual Studio run-time and monitor the object count and reference count of NewProjectDialog object. investigated the memory leaks in VS; Investigated problems using Maui for stress and performance test (8 hours run)
  • Developed daily report tool GroupRunSummary using Asp.Net, Excel, Maddog SDK, Product Studio, TFS
  • Developed batch file for system installation and increased productivity
  • Developed report generation program for efficient results management and analysis (VB. Net)
  • Developed automation test cases using C# and VB.Net to test Visual Studio, Download Management Studio, and DSP WCF Web Services; Solved focus problem of debugging test cases by using Remote Debugger Technology
  • Located keys for resource strings in managed assembly to solve the instability issues of resource strings
  • Developed online template generator tool for new features in Visual Studio 2010
  • Developed automation user interface for DMS tool based on MSAA and UIA technologies; Developed SDK Automation framework for testing DSP WCF Web Services
  • Developed Testing data source for data driving testing

Kayang

R&D Engineer Kayang Information System Co. Ltd, Shanghai, China

2006/06 – 2007/01

  • Delivered Kayang HR Workshop (based on Visual Basic 6.0, SQL Server 2000) ver. 7.0.163 to 7.0.165
  • Developed Kayang HR Data Transfer Tool (using Windows Services, XML, SQL Server) which is used to transfer data between Database, File Server and ERP System
  • Delivered Kayang HR Workshop Special Version for Bank Company (Change the approval workflow using Maker and Check System)

Education

  • 2003/09-2006/06 Master in Geology and Paleontology - Chinese Academy of Science, China
  • 1999/09-2003/06 Bachelor in Biology - Nanjing University, China

Certificates

  • 2007/05 Microsoft Certified Applications Developer (MCAD)
  • 2007/12 Maui/Maddog Expert in Chinasoft

Appendix

Adobe Molehill API is a Flash 3D solution which can leverage the power of graphic card, it’s a fast render system
Maui – Microsoft Automation User Interface, it is a testing framework which leverages the MSAA and UIA technology to drive user interfaces for testing purpose
Maddog – Internal test management platform in Microsoft
ACG – Autodesk Consumer Group
ACRD – Autodesk China Research & Development Center
HomeStyler is a free interior design software, which provide great 2D/3D design experience for consumers; it also can provide high definition RAAS images
Pixlr is an online and mobile picture edition software, which is very popular for all kinds of users and has great user group.
Instructables is a website specializing in user-created and uploaded do-it-yourself projects, which other users can comment on and rate for quality.
123D Creature – Please check the following application on iTunes Store and Google Play
https://itunes.apple.com/ca/app/123d-creature/id594014056?mt=8
https://play.google.com/store/apps/details?id=com.autodesk.Sculpt&hl=en
CSG – Constructive solid geometry (CSG) (formerly called computational binary solid geometry) is a technique used in solid modeling. Constructive solid geometry allows a modeler to create a complex surface or object by using Boolean operators to combine simpler objects.

Storypad Intruduction

Coding to learn and make stunning vector graphic images

Introduction

Build your first StoryBlocks

Learn code is not easy, we just try to help you enjoy it

Introduce StoryBlocks

StoryBlocks is an application that was built on top of Google Blockly. It help you to learning using code to de simple 2D graphics design by manipulate 2D primitives. Once you mastered it, it is a powerful tool. During the learning journey of StoryBlocks, you can get chance to understand basic computer programming concept and graphics design experience.

Let’s start. What is Blocks? Like building a house with bricks, blocks is what we used here to build an runnable application which can create the content you want. Here is some screenshot from the app.

Blocks Example

"s"

This is Rectangle block, which you can use it to create a rectangle. Dragging it out, you can see all the parameters.