I’m a Computer Science Master’s student at the University of Pennsylvania, working on Large Language Models (LLMs), Vision-Language Models (VLMs), and NLP applications in AI for Science. [Note] I am actively applying for Fall 2026 CS Ph.D. programs.

My academic interests and journey have been driven by a persistent question: How and Why do complex systems work?

I have explored this across systems (OS internals with Prof. Xu Liu, NCSU), HCI/Visualization (human cognition with Prof. Ying Zhao, CSU), and IoT/Crowdsourcing (collective intelligence with Prof. Anfeng Liu, CSU). This consistent search for an underlying method, whether in machines or the human mind, has ultimately led me to natural language processing (NLP).

My commitment to NLP was solidified during my senior year, after leading multiple research projects and founding my AI startup. I found in language models the ultimate synthesis of my interests: complex systems grounded in both computational structure and human cognition. I was captivated not just by their capabilities, but by the intellectual challenge of looking under the hood and adapting them to solve real-world problems. This passion fuels my core research ambition: to move beyond treating LLMs as black boxes and instead enhance them through principled, effective, efficient, and explainable methods.

At Penn, I’m fortunate to be advised by Prof. Chris Callison-Burch, Prof. Lyle Ungar, and Delip Rao. I also collaborate with Dr. Xiaodong Yu from AMD GenAI and Prof. Yunhuai Liu from Peking University.

My research centers on advancing LLMs and Multimodal LLMs through Effective, Efficient, and Explainable methods. I currently focus on:

  • Unlocking LLMs’ Internal Mechanisms: Designing training-free & inference-time optimization methods grounded in attention patterns, activations, representations, token logits, and prompting mechanisms. I’m particularly interested in making models more interpretable while improving their performance (Where + What + How + Why)
  • Pushing Application Boundaries: Building impactful systems in security, code understanding, and scientific automation, with measurable real-world outcomes. I believe in creating practical solutions that address open-end and unexplored real-world challenges.
  • Advancing Model Evolution: Developing data synthesis and curation pipelines to overcome annotation & data collection bottlenecks, and exploring post-training optimization (SFT, RL) and distillation to make smaller models competitive (data, training, distillation, pruning)

I am also the co-founder of Savable Koupon AI, where we build AI-driven price tracking, LLM-based product analysis, and recommendation systems for e-commerce.

All NLP work listed below was completed in 2024-2025. You can find my publications on Google Scholar.

❤️ Future Research Directions

In addition to continuing my current research interests, I am also eager to explore several new directions.

1. Fundamental Model Enhancement

First, while inference-time adaptations are effective, I believe that scaling or optimizing models during training will eventually surpass these approaches. As resources allow, I plan to shift my focus from inference-time tweaks to optimizing models during training.

Second, I aim to use interpretability not only to explain model behavior but also to improve training processes. For instance, insights from the attention-sink mechanism (since 2022) have led to advancements in KV-cache optimization, quantization-aware training, and extensions to VLMs. I intend to develop explainable methods in the following areas: (1) Understanding how information flows within the model—e.g., optimizing layer and head interactions; (2) Understanding how token generation works—e.g., introducing interpretable decoding control; (3) Understanding how reasoning functions—e.g., enabling smaller models to compete with larger ones and orchestrating efficient interactions between reasoning and non-reasoning components.

Finally, as LLM research has advanced more rapidly than multimodal research, I am also particularly interested in Multimodal LLMs. This includes identifying and addressing limitations in current MLLM architectures, and developing more effective and efficient methods for processing multimodal information, especially in addressing challenges like visual redundancy and modality alignment.

2. AI for Scientific Discovery

The next frontier is applying (M)LLMs to scientific discovery and applications, focusing on: (1) Discovering valuable new application areas as LLM capabilities continue to expand; (2) Adapting and optimizing models for specific scientific domains; (3) Tackling problems from multiple perspectives: (a) Unknown problems (building new benchmarks); (b) Known problems with: simple evaluation but challenging solutions (developing effective methods), or high-cost evaluation (developing efficient methods); easy solutions but complex evaluation requirements (e.g., designing reward for RLVR).

🔥 News

  • November 2025:  🎉 Two papers accepted to AAAI 2026 - “LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection” and “Beyond Detection: A Comprehensive Benchmark and Study on Representation Learning for Fine-Grained Webshell Family Classification”
  • July 2025:  🎉 Paper accepted to COLM 2025 - “Can LLMs handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework”
  • June 2025:  🎉 Paper accepted to MOSS@ICML2025 - “ZeroTuning: Unlocking the Initial Token’s Power to Enhance Large Language Models Without Training”

📝 Selected Publications

For a complete list of publications, please visit my Google Scholar

🔮 Research Interest 1: Uncovering NLP & LLM Internal Mechanism and Interpretability

MOSS@ICML2025
ZeroTuning Overview

ZeroTuning: Unlocking the Initial Token’s Power to Enhance Large Language Models Without Training

Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar

Paper | Code & Demo | Blog | Poster

Key Points:

  • Novel training-free optimization via initial token attention steering, supporting both supervised and unsupervised calibrations
  • Lightweight implementation (four lines of code modification) achieves substantial gains: 19.9% on classification, 4.5% on QA, and 2.1% on multi-turn dialogue
  • Explains why this method works through: (1) theoretical analysis; (2) output entropy and accuracy analysis; (3) error pattern analysis; (4) fine-grained layer/head analysis
📑 Click to see abstract
Token-level attention tuning, a class of training-free methods including Post-hoc Attention Steering (PASTA, AutoPASTA) and Attention Calibration (ACT), has emerged as a promising way to improve frozen LLMs with interpretable interventions. However, these methods depend on auxiliary heuristics to identify "important" task-specific tokens, which can introduce bias and limit applicability when token importance is unclear or when using optimized kernels where attention maps are inaccessible. We propose a simpler and more elegant alternative: acting only on the initial token (e.g., <BOS> in LLaMA). We show theoretically that adding lightweight biases to this token's attention logits monotonically controls the entropy of the downstream attention distribution--an effect amplified by its natural function as an attention sink. Our empirical analysis reveals that this tuning process can positively affect LLMs and better unlock their pretrained knowledge, with stronger effects in early layers and distinct scaling preferences across attention heads. Building on these insights, we introduce ZeroTuning: a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token, requiring zero parameter updates. We present two variants: a supervised mode that calibrates on validation examples, and a novel unsupervised mode that directly minimizes the model's output entropy. Our method requires no KV‑cache or decoding changes, and is kernel‑agnostic (works with SDPA and FlashAttention). The method is lightweight and requires only four lines of modification to the standard LlamaAttention code. It achieves broad gains across 15 datasets and outperforms previous, more complex methods; for instance, with Llama-3.1-8B, it yields relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. ZeroTuning also works out-of-the-box with quantized inference and maintains its performance improvements with increasing context lengths. Our code and runnable demo are available at https://anonymous.4open.science/r/ZeroTuning.
Arxiv
SSR+

Read Before You Think: Mitigating LLM Comprehension Failures with Step-by-Step Reading

Feijiang Han, Hengtao Cui, Licheng Guo, Zelong Wang, Zhiyuan Lyu

Paper | Blog

Key Points:

  • Identified Semantic Misunderstanding as the core bottleneck in LLMs reasoning even with strong methods like CoT
  • Designed SSR Series to resolve this issue by: (1) applying step-by-step reading logic (SSR), (2) enforcing attention on key tokens via self-reference (SSR+), and (3) resolving backward dependencies through iterative re-contextualization (SSR++)
📑 Click to see abstract
Large Language Models (LLMs) often fail on complex reasoning tasks due to flawed question comprehension, not just flawed logic. This paper presents a systematic investigation into these comprehension failures. Our work yields three key insights: (1) the step-by-step principle, effective for calculation, can be migrated to the reading process to enhance comprehension; (2) increasing the proportion of question-related tokens (e.g., via repetition) succeeds by refocusing attention, a mechanism that can be explicitly controlled; and (3) backward dependencies represent a core bottleneck for decoder-only models that persists even with strong methods like Chain-of-Thought. Based on these findings, we introduce the Step-by-Step Reading (SSR) family of prompts. This multi-stage approach culminates in SSR++, a method specifically engineered to deepen model comprehension by guiding it to parse questions with finer granularity, focus attention on critical tokens, and resolve backward dependencies through iterative re-contextualization. SSR++ sets a new state-of-the-art on multiple reasoning benchmarks, and our analysis confirms it works by directly mitigating semantic misunderstanding. These results demonstrate that guiding how a model reads is a powerful and efficient method for improving its reasoning ability.

🔍 Research Interest 2: Domain-Adapted Language Models for Code, Document, and Scientific Automation

COLM 2025
WebShell Detection Framework

Can LLMs handle WebShell detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework

Feijiang Han, Jiaming Zhang, Chuyi Deng, Jianheng Tang, Yunhuai Liu

Paper | Blog | Poster

Key Points:

  • First comprehensive study of LLMs’ capabilities in WebShell detection
  • Novel BFAD framework improves LLM detection by 13.82% through function-aware analysis
  • Enables both large and small LLMs to outperform traditional SOTA methods
📑 Click to see abstract
WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional machine learning and deep learning methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models (LLMs) have emerged as a powerful alternative for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two major contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling (WBFP) that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous State-Of-The-Art (SOTA) methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models like GPT-4, LLaMA-3.1-70B, and Qwen-2.5-Coder-14B now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.
AAAI 2026
LaTeX2Layout Pipeline

LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection

Feijiang Han, Zelong Wang, Bowen Wang, Xinxin Liu, Skyler Cheung, Delip Rao, Chris Callison-Burch, Lyle Ungar

[Paper] | [Code & Dataset] (Coming Soon)

Key Points:

  • Novel pipeline extracting PDF layout information directly from LaTeX compilation (No Human annotations and PDF Parsers)
  • Custom LaTeX packages for precise element tracking and accurate layout extraction
  • 200% relative improvement over zero-shot baselines through curriculum learning and synthetic data augmentation
📑 Click to see abstract
General-purpose Vision-Language Models (VLMs) are increasingly integral to modern AI systems for document understanding, yet their ability to perform fine-grained layout analysis remains severely underdeveloped. Overcoming this requires a large-scale, high-fidelity training dataset. However, current annotation methods, which rely on parsing rendered PDFs, are costly, error-prone, and fail to scale effectively. This work introduces a paradigm shift in data acquisition to resolve this bottleneck. We present LaTeX2Layout, a novel and generalizable procedural pipeline that obtains ground-truth layout information not from the final PDF, but directly from the LaTeX compilation process itself. By instrumenting the compiler, our method produces pixel-perfect bounding boxes and reading order, entirely bypassing the ambiguities of post-rendering parsers. This efficient and accurate pipeline enables us to generate a massive dataset of 140K pages, including 120K programmatically-generated variants that more than double the layout diversity of real-world datasets. This unique dataset allows us to fine-tune a highly efficient 3B parameter VLM, employing a curriculum learning strategy that re-ranks training examples from simple to complex layouts to optimize convergence. Our model establishes a new state-of-the-art, achieving a Kendall's Tau of 0.95 for reading order and a mAP@0.5 of 0.91 for element grounding---a nearly 200% relative improvement over formidable zero-shot baselines like GPT-4o and Claude-3.7.
AAAI 2026
WebShell Family Classification

Beyond Detection: A Comprehensive Benchmark and Study on Representation Learning for Fine-Grained Webshell Family Classification

Feijiang Han

[Paper] (Coming Soon)

Key Points:

  • First systematic study automating WebShell family classification through representation learning
  • Novel dynamic function call trace extraction and LLM-based synthetic trace generation for behavioral analysis
  • Comprehensive evaluation of representation methods (sequence, graph, and tree-based models) across multiple datasets with practical insights for optimal model selection
📑 Click to see abstract
Malicious WebShells represent a severe and evolving threat, compromising critical digital infrastructures and endangering public services in sectors such as healthcare and finance. While the research community has achieved considerable success in WebShell detection (distinguishing malicious from benign samples), we argue it is time to advance from passive detection to a new stage of in-depth analysis and proactive defense. A promising and critical direction is the automation of WebShell family classification: identifying the specific malware lineage to understand an adversary's tactics and enable a precise, rapid response. This crucial task, however, remains a largely unexplored area that currently relies on slow, manual expert analysis. To address this gap, we present the first systematic study to automate WebShell family classification. Our method begins with extracting dynamic function call traces to capture inherent behaviors that are resistant to common encryption and obfuscation. To enhance the scale and diversity of our dataset for a more stable evaluation, we augment these real-world traces with new variants synthesized by a Large Language Model (LLM). These augmented traces are then abstracted into sequences, graphs, and trees, providing a foundation to benchmark a comprehensive suite of representation methods. Our evaluation spans classic sequence-based embeddings (CBOW, GloVe), transformers (BERT, SimCSE), and a range of structure-aware algorithms, including Graph Kernels, Graph Edit Distance, Graph2Vec, and various Graph Neural Networks.

🌟 Research Interest 3: Other Topics (HCI, Big Data Visualization, IoT, Federated and Continual Learning)

Information Sciences 2023
CQL-MAB Overview

Credit and quality intelligent learning based multi-armed bandit scheme for unknown worker selection in multimedia MCS
Jianheng Tang, Feijiang Han, Kejia Fan, et al.
Key Points:

  • Novel Credit and Quality Learning based Multi-Armed Bandit (CQL-MAB) scheme for solving the Post-Unknown Worker Recruitment problem in MCS
  • Integrates credit identification and quality calculation for worker selection
  • Theoretically proven truthfulness and efficiency in reverse auction settings
📑 Click to see abstract
The field of intelligent multimedia systems, which rely heavily on multimodal models trained on large amounts of high-quality data, has been revolutionized by the use of deep learning. One promising approach to collect such multimodal data is Mobile Crowd Sensing (MCS). However, MCS platforms face a significant challenge in selecting both high-credit and high-quality workers at low cost due to the Post-Unknown Worker Recruitment (PUWR) problem. The PUWR problem makes it difficult to determine the credits and qualities of workers in advance, which can lead to the recruitment of dishonest or low-quality workers. This problem severely affects the quality and quantity of MCS data collection, posing a serious threat to the security and robustness of large-scale multimedia models. To address this issue, we propose a Credit and Quality Learning based Multi-Armed Bandit (CQL-MAB) scheme, which consists of a novel credit identification algorithm, a fine-grained worker quality calculation method, and a two-stage reward-based Multi-Armed Bandit (MAB) for worker selection in reverse auction. The theoretical proof shows that the CQL-MAB scheme achieves the truthfulness, individual rationality, and efficiency of the auction mechanism. A large number of simulation experiments on real data traces are conducted to demonstrate the outstanding performance of CQL-MAB.

🎖 Honors and Awards

  • 2024 Xiaomi Special Scholarship (Top 10 university-wide)
  • 2024 Outstanding Graduate of the Class of 2024
  • 2023 National Scholarship for Outstanding Students (Top 5)

📝 Notes & Experiences

Study Abroad Experience

📅 Schedule a Meeting

If you’d like to discuss research collaboration or have any questions, feel free to schedule a meeting with me:

If you feel our backgrounds align and you’d like to collaborate, get help, or seek mentorship, please fill out this short form: Collaboration Interest Form