š Selected Publications
For a complete list of publications, please visit my Google Scholar
š® Research Interest 1: Uncovering NLP & LLM Internal Mechanism and Interpretability
ZeroTuning: Unlocking the Initial Tokenās Power to Enhance Large Language Models Without Training
Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar
Paper | Code & Demo | Blog | Poster | ICLR Talks
TL;DR. Training-free attention tuning can boost frozen LLMs, but prior methods often depend on fragile heuristics to find āimportantā task tokens. ZeroTuning shows a simpler universal control lever: tune only the initial token (e.g., <BOS>). With tiny head-specific biases on BOS attention logits, we can reshape downstream attention (sharpen/flatten), lower output entropy, and unlock pretrained knowledgeāwithout any parameter updates.
Key Points:
- Lightweight + practical: ~4-line change, no KV-cache / decoding changes, kernel-agnostic (works with SDPA & FlashAttention), and works with quantized inference.
- Strong, broad gains: across 15 datasets, e.g., on Llama-3.1-8B: +19.9% (classification), +4.5% (QA), +2.1% (dialogue); MT-Bench improves 7.804 ā 7.966.
- Why it works (mechanism insights): BOS acts as an attention sink, so tuning it gives monotonic control of attention entropy; effects are stronger in earlier layers and heterogeneous across heads (up-effective vs. down-effective). Includes supervised calibration and an unsupervised entropy-minimization variant.
Read Before You Think: Mitigating LLM Comprehension Failures with Step-by-Step Reading
Feijiang Han, Hengtao Cui, Licheng Guo, Zelong Wang, Zhiyuan Lyu
TL;DR. Many āreasoningā failures in LLMs are actually comprehension failuresāthe model misreads the question (semantic misunderstanding), so even Chain-of-Thought canāt reliably help. We introduce Step-by-Step Reading (SSR), a training-free framework that makes models read before they think: parse the question incrementally, keep each reasoning step grounded to the text, and fix backward dependencies via iterative re-contextualization.
Key Points:
- Identified Semantic Misunderstanding as a core reasoning bottleneck that persists even with CoT, stemming from the inherent constraints of the unidirectional attention mechanism.
- Explained the effectiveness of prompt repetition through the lens of Attention: it helps models suppress focus on low-semantic tokens (e.g., punctuation) and redistribute attention to critical information.
- Proposed a training-free framework to resolve these issues by: (1) applying step-by-step reading logic, (2) automatically steering attention to key tokens via self-reference, and (3) resolving backward dependencies through iterative re-contextualization.
š Research Interest 2: Model Adaptation
Feijiang Han, Jiaming Zhang, Chuyi Deng, Jianheng Tang, Yunhuai Liu
TL;DR. WebShell detection is hard for LLMs because a server-side script can span millions of tokens while the truly malicious logic is often just a tiny, obfuscated fragmentāso naĆÆvely feeding the whole file dilutes the signal and breaks context limits. We provide the first comprehensive evaluation of LLMs for WebShell detection and introduce BFAD, a behavior-driven, function-aware pipeline that helps LLMs focus on the most indicative code, yielding a +13.82% average F1 improvement and pushing both large and small LLMs toward (or beyond) prior SOTA.
Task: input = {server-side script / PHP file} ā output = {WebShell / Benign}.
LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection
Feijiang Han, Zelong Wang, Bowen Wang, Xinxin Liu, Skyler Cheung, Delip Rao, Chris Callison-Burch, Lyle Ungar
Paper | [Code & Dataset] (Release Due: 2026.3.1)
TL;DR. Layout detection turns a PDF into structured page understanding (bounding boxes + reading order), but current VLMs struggle mainly because high-fidelity supervision is scarce and PDF-parser-based labels are noisy and expensive. We introduce LaTeX2Layout, a scalable data-centric pipeline that extracts pixel-accurate layout ground truth directly from the LaTeX compilation process, enabling large-scale training without manual annotation.
Task: input = {PDF document} ā output = {page elementsā bounding boxes + reading order (optionally OCR)}.
Feijiang Han
Paper | Video (AI) | Slide (AI) | [Code & Dataset] (Release Due: 2026.3.1)
TL;DR. While WebShell detection answers āmalicious or not,ā real-world defense also needs attribution and tracking: WebShells come in diverse families with different behaviors and variants. We are the first to systematically study representation learning for automated WebShell family classification.
Task: given a WebShell ā predict its family ID.
Key Points:
- Benchmark: the first systematic study of representation learning for fine-grained WebShell family classification.
- Behavioral view: dynamic function-call traces + LLM-augmented variants for robust behavioral analysis.
- Key finding: structural representations (especially tree-based GNNs / Tree-GAT) consistently outperform sequence models for family attribution.
ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
Delip Rao, Feijiang Han, Chris Callison-Burch
Poster | [Paper] (Coming Soon)
TL;DR. Efficient scientific claim verification is essential for trustworthy literature review and retrievalābut most strong verifiers are large, expensive, and hard to interpret. We develop ThinknCheck, a compact āreason first, then decideā verifier, and summarize best practices for making small LLMs reliable and interpretable on document-grounded claim verification.
Task: input = {Document, Claim} ā output = {True / False}.
Key Points:
- 1B-scale, 4-bit ThinknCheck verifier trained to āreason first, then decideā for scientific claim verification
- New reasoning-augmented datasets LLMAggreFact-Think and GSMClaims for document-grounded scientific and arithmetic claims
- Small model matches or surpasses larger specialized verifiers (e.g., MiniCheck-7B) while providing short, interpretable rationales
š Research Interest 3: Other Topics (HCI, Big Data Visualization, IoT, Federated and Continual Learning)
Credit and quality intelligent learning based multi-armed bandit scheme for unknown worker selection in multimedia MCS
Jianheng Tang, Feijiang Han, Kejia Fan, et al.
TL;DR. High-quality training data is the bottleneck for modern multimodal and foundation models, and mobile crowd sensing (MCS) is a scalable way to collect itābut platforms must recruit workers before knowing who is trustworthy or produces high-quality data. We formulate this as an online decision-making problem under uncertainty and propose CQL-MAB, a bandit-style RL scheme that learns workersā credit (honesty) and quality (data utility) from feedback and selects workers cost-effectively with incentive guarantees.
Why this is RL: itās a contextual multi-armed bandit: repeatedly choose āarmsā (workers), observe stochastic rewards (credit/quality), and minimize regret while respecting budget/auction constraints.
Task: input = {workersā bids + streaming feedback from their submitted data} ā output = {selected worker set (and payments) each round}.
Key Points:
- CQL-MAB: jointly models credit (trustworthiness) and quality (data utility) as rewards for bandit-based recruitment under budget.
- Two-stage / two-level reward UCB to pick workers while continuously updating beliefs about unknown workers.
- Proven properties in reverse auctions: truthfulness, individual rationality, and computational efficiency, plus strong empirical performance (revenue/regret).
-
UBICOMP 2025CALM: A Ubiquitous Crowdsourced Analytic Learning Mechanism for Continual Service Construction with Data Privacy Preservation
Kejia Fan, Yuwei Huang, Jiayi He, Feijiang Han, Jianheng Tang, et al. -
arXiv 2025APFL: Analytic Personalized Federated Learning via Dual-Stream Least Squares
Kejia Fan, Jianheng Tang, Zixuan Yang, Feijiang Han, Jiayi Li, et al. -
arXiv 2025ACU: Analytic Continual Unlearning for Efficient and Exact Forgetting with Privacy Preservation
Jianheng Tang, Haotian Zhuang, Dongxiao Fang, Jiayi Li, Feijiang Han, et al. -
Information Sciences 2024MAB-RP: A Multi-Armed Bandit based workers selection scheme for accurate data collection in crowdsensing
Yuwei Lou, Jianheng Tang, Feijiang Han, Anfeng Liu, et al. -
Information and Software Technology 2024Fctree: Visualization of function calls in execution
Fei Zhou, Yifan Fan, Shengchao Lv, Lingxiao Jiang, Zhuo Chen, Jingui Yuan, Feijiang Han, et al. -
IEEE IoT Journal 2023CRL-MABA: a completion rate learning-based accurate data collection scheme in large-scale energy internet
Kejia Fan, Jianheng Tang, Wenbin Xie, Feijiang Han, Yuwei Huang, et al. -
IEEE IoT Journal 2023BTV-CMAB: A bi-directional trust verification-based combinatorial multiarmed bandit scheme for mobile crowdsourcing
Jianheng Tang, Kejia Fan, Wenbin Xie, Feijiang Han, et al. -
Computer Communications 2023A Semi-supervised Sensing Rate Learning based CMAB scheme to combat COVID-19 by trustful data collection in the crowd
Jianheng Tang, Kejia Fan, Wenbin Xie, Lingxiao Zeng, Feijiang Han, et al.





