Projects
Implementations, experiments, and explorations.
✦
ML & Deep Learning
Yuntun: Qwen3-0.6B with Megatron-Style Tensor ParallelismGitHub →
PyTorch, Megatron-Style TP (Column/Row/Vocab Sharding), FineWeb Streaming, HuggingFace Parity, RoPE/GQA/RMSNorm/QK-Norm
- Built a minimal Qwen3-style causal LM from scratch with pre-training on FineWeb (streaming, sample-10BT), with gradient accumulation, checkpointing
- Implemented Megatron-style tensor-parallel layers (column/row linear, vocab-parallel embedding and LM head with custom autograd), RoPE and GQA in the decoder
Weigou: Minimal 4D-Parallel LLaMA Training (SmolLM-360M)GitHub →
PyTorch, 4D Parallelism (TP/CP/PP/DP), Custom Ring Attention, Pipeline Parallelism (1F1B/AFAB), Flash Attention, SLURM
- Built a lean 4D-parallel training stack from scratch, including tensor, context, pipeline, and data parallelism with a unified process group manager over a DP×PP×CP×TP grid, plus bucketed gradient synchronization across CP+DP ranks.
- Implemented Megatron/Pictron-style tensor parallel layers (column/row/vocab sharding), ring-attention based context parallelism with RoPE slicing, and a pipeline engine (1F1B/AFAB) for LLaMA-like models, wired into a config/CLI + SLURM workflow for multi-node experiments.
GPT-2 Speedrun: Single-Node Multi-GPU Pre-Training (DDP)GitHub →
PyTorch, Distributed Data Parallel (DDP), torch.compile, AMP (BF16/FP16)
- Implemented an end-to-end GPT-2 (124M) pre-training stack with DDP gradient accumulation, cosine LR + warmup scheduling, checkpoint/resume, and optional initialization from HuggingFace GPT-2 weights.
- Optimized throughput via torch.compile, fused AdamW (CUDA), TF32 matmul, Flash SDP attention when available, pinned-memory non-blocking transfers, and mixed precision (BF16/FP16 w/ GradScaler).
BeaconGradGitHub →
Python, NumPy
- A NumPy-based tensor automatic-differentiation (autograd) engine with broadcasting-aware backprop, neural modules, and optimizers; validated gradients via finite-difference gradchecks and float64 PyTorch parity tests.
Optimized YOLOv11 for Document Layout Recognition and Inference
PyTorch, YOLO, TensorRT, onnxruntime, OpenVINO
- Fine-tuned YOLOv11 on DocLayNet for document layout analysis (captions, footnotes, formulas, etc.).
- Accelerated inference via TensorRT, ONNXRUNTIME, and OpenVINO, achieving scalable batch processing with threaded execution.
Expandable Subspace Ensemble for Class-Incremental LearningGitHub →
PyTorch, NumPy
- Implemented a subspace expansion technique to retain previous classes without forgetting, benchmarked on CIFAR-10 from scratch.
Generative & Probabilistic
Discrete Walk-Jump Sampling for Protein DiscoveryGitHub →
PyTorch, Energy-Based Models, Langevin MCMC, Contrastive Divergence, Denoising Networks
- Implemented Discrete Walk-Jump Sampling for antibody sequence generation using EBMs trained via contrastive divergence.
- Employed Langevin MCMC for exploration and one-step denoising for refinement, optimizing sampling efficiency and sequence quality.
Concrete Score Matching: Generalized Score Matching for Discrete DataGitHub →
PyTorch, NumPy, Concrete Score Matching, Metropolis–Hastings
- Implemented the CSM algorithm to learn score functions in discrete spaces.
- Used Metropolis–Hastings sampling for data generation and visualized true vs. generated distributions.
✦