Cheng-Yen (Wesley) Hsieh

I am a research scientist at ByteDance Research, based in San Jose.

My research covers machine learning and computer vision. My current research focuses on advancing AI for scientific discovery. Specifically, I develop large-scale multi-modal diffusion language models to tackle protein modeling. My earlier work also explored foundation research areas, including self-supervised learning, amodal object tracking, federated learning, and vision language models.

I was advised by Prof. Deva Ramanan during my master of science in computer vision at Carnegie Mellon University. I received my B.S. from National Taiwan University. I had the pleasure to work with Prof. Yu-Chiang Frank Wang and Prof. An-Yeu (Andy) Wu.

Github / Google Scholar / chengyenhsieh0806@gmail.com

news

May 1, 2024	DPLM-2.1 is accepted as a Spotlight at ICML 2025!
Apr 16, 2024	Launched the official page of our DPLM series.
Mar 4, 2024	Joined ByteDance as a AI research scientist.
May 29, 2023	Joined Waymo as a Machine Learning Engineer Intern.
Aug 22, 2022	I joined CMU RI as a master student in computer vision.

selected publications

Elucidating the Design Space of Multimodal Protein Language Models

Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, and Quanquan Gu

ICML, 2025 (Spotlight, Top 2.6% of submissions)

Design choices are essential: Our designs enable the 650M multimodal PLM to outperform 3B-scale baselines and specialized structure folding models.

Abs PDF Code

Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.
TAO-Amodal: A Benchmark for Tracking Any Object Amodally

Cheng-Yen Hsieh, Kaihua Chen, Achal Dave, Tarasha Khurana, and Deva Ramanan

arXiv preprint, Nov 2023

Our solution to unravel occlusion scenarios for any object—amodal tracking.

Abs Page PDF Code

Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of modal annotations in most benchmarks. To address the scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences. Our dataset includes amodal and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame. We investigate the current lay of the land in both amodal tracking and detection by benchmarking state-of-the-art modal trackers and amodal segmentation methods. We find that existing methods, even when adapted for amodal tracking, struggle to detect and track objects under heavy occlusion. To mitigate this, we explore simple finetuning schemes that can increase the amodal tracking and detection metrics of occluded objects by 2.1% and 3.3%.
Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Cheng-Yen Hsieh, Chih-Jung Chang, Fu-En Yang, and Yu-Chiang Frank Wang

IEEE WACV, 2023

One can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection,and instance segmentation with this pre-training algorithm.

Abs PDF Code

While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well to downstream tasks at patch or pixel levels. Moreover, existing SSL methods might not sufficiently describe and associate the above representations within and across image scales. In this paper, we propose a Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed to derive pyramid representations at patch levels via learning proper prototypes, with additional learners to observe and relate inherent semantic information within an image. In particular, we present a cross-scale patch-level correlation learning in SS-PRL, which allows the model to aggregate and associate information learned across patch scales. We show that, with our proposed SS-PRL for model pre-training, one can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection, and instance segmentation.
C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Cheng-Yen Hsieh, Yu-Chuan Chuang, and An-Yeu (Andy) Wu

IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), 2022

Split Learning (SL) for efficient image recognition through dimension-wise compression.

Abs PDF Code

Most existing studies improve the efficiency of Split learning (SL) by compressing the transmitted features. However, most works focus on dimension-wise compression that transforms high-dimensional features into a low-dimensional space. In this paper, we propose circular convolution-based batch-wise compression for SL (C3-SL) to compress multiple features into one single feature. To avoid information loss while merging multiple features, we exploit the quasi-orthogonality of features in high-dimensional space with circular convolution and superposition. To the best of our knowledge, we are the first to explore the potential of batch-wise compression under the SL scenario. Based on the simulation results on CIFAR-10 and CIFAR-100, our method achieves a 16x compression ratio with negligible accuracy drops compared with the vanilla SL. Moreover, C3-SL significantly reduces 1152x memory and 2.25x computation overhead compared to the state-of-the-art dimension-wise compression method.
FL-HDC: Hyperdimensional Computing Design for the Application of Federated Learning

Cheng-Yen Hsieh, Yu-Chuan Chuang, and An-Yeu (Andy) Wu

IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2021

Highly efficienct image recognition under the federated learning (FL) scenario.

Abs PDF Code

Federated learning (FL) is a privacy-preserving learning framework, which collaboratively learns a centralized model across edge devices. Each device trains an independent model with its local dataset and only uploads model parameters to mitigate privacy concerns. However, most FL works focus on deep neural networks (DNNs), whose intensive computation hinders FL from practical realization on resource-limited edge devices. In this paper, we exploit the high energy efficiency properties of hyperdimensional computing (HDC) to propose a federated learning HDC (FL-HDC). In FL-HDC, we bipolarize model parameters to significantly reduce communication costs, which is a primary concern in FL. Moreover, we propose a retraining mechanism with adaptive learning rates to compensate for the accuracy degradation caused by bipolarization. Under the FL scenario, our simulation results show the effectiveness of our proposed FL-HDC across two datasets, MNIST and ISOLET. Compared with the previous work that transmits complete model parameters to the cloud, FL-HDC greatly reduces 23x and 9x communication costs with comparable accuracy in ISOLET and MNIST, respectively.