Projects

Implementations, experiments, and explorations.

✦

ML & Deep Learning

Yuntun: Qwen3-0.6B with Megatron-Style Tensor ParallelismGitHub →

PyTorch, Megatron-Style TP (Column/Row/Vocab Sharding), FineWeb Streaming, HuggingFace Parity, RoPE/GQA/RMSNorm/QK-Norm

Built a minimal Qwen3-style causal LM from scratch with pre-training on FineWeb (streaming, sample-10BT), with gradient accumulation, checkpointing
Implemented Megatron-style tensor-parallel layers (column/row linear, vocab-parallel embedding and LM head with custom autograd), RoPE and GQA in the decoder

Weigou: Minimal 4D-Parallel LLaMA Training (SmolLM-360M)GitHub →

PyTorch, 4D Parallelism (TP/CP/PP/DP), Custom Ring Attention, Pipeline Parallelism (1F1B/AFAB), Flash Attention, SLURM

Built a lean 4D-parallel training stack from scratch, including tensor, context, pipeline, and data parallelism with a unified process group manager over a DP×PP×CP×TP grid, plus bucketed gradient synchronization across CP+DP ranks.
Implemented Megatron/Pictron-style tensor parallel layers (column/row/vocab sharding), ring-attention based context parallelism with RoPE slicing, and a pipeline engine (1F1B/AFAB) for LLaMA-like models, wired into a config/CLI + SLURM workflow for multi-node experiments.

GPT-2 Speedrun: Single-Node Multi-GPU Pre-Training (DDP)GitHub →

PyTorch, Distributed Data Parallel (DDP), torch.compile, AMP (BF16/FP16)

Implemented an end-to-end GPT-2 (124M) pre-training stack with DDP gradient accumulation, cosine LR + warmup scheduling, checkpoint/resume, and optional initialization from HuggingFace GPT-2 weights.
Optimized throughput via torch.compile, fused AdamW (CUDA), TF32 matmul, Flash SDP attention when available, pinned-memory non-blocking transfers, and mixed precision (BF16/FP16 w/ GradScaler).

Python, NumPy

A NumPy-based tensor automatic-differentiation (autograd) engine with broadcasting-aware backprop, neural modules, and optimizers; validated gradients via finite-difference gradchecks and float64 PyTorch parity tests.

Optimized YOLOv11 for Document Layout Recognition and Inference

PyTorch, YOLO, TensorRT, onnxruntime, OpenVINO

Fine-tuned YOLOv11 on DocLayNet for document layout analysis (captions, footnotes, formulas, etc.).
Accelerated inference via TensorRT, ONNXRUNTIME, and OpenVINO, achieving scalable batch processing with threaded execution.

Expandable Subspace Ensemble for Class-Incremental LearningGitHub →

PyTorch, NumPy

Implemented a subspace expansion technique to retain previous classes without forgetting, benchmarked on CIFAR-10 from scratch.

Discrete Walk-Jump Sampling for Protein DiscoveryGitHub →

PyTorch, Energy-Based Models, Langevin MCMC, Contrastive Divergence, Denoising Networks

Implemented Discrete Walk-Jump Sampling for antibody sequence generation using EBMs trained via contrastive divergence.
Employed Langevin MCMC for exploration and one-step denoising for refinement, optimizing sampling efficiency and sequence quality.

Concrete Score Matching: Generalized Score Matching for Discrete DataGitHub →

PyTorch, NumPy, Concrete Score Matching, Metropolis–Hastings

Implemented the CSM algorithm to learn score functions in discrete spaces.
Used Metropolis–Hastings sampling for data generation and visualized true vs. generated distributions.

✦