Research & Publications
Our team has nearly 50 academic publications with 4,000+ citations. We publish in top venues and release open-source tools used across the legal, AI, and software industries.
Books
The Math Inside the Machine: How Intelligence Emerges from Eleven Simple Operations
Bommarito, M. J.. 2026.
Explains the mathematical foundations of AI and machine learning through eleven core operations.
Agentic AI in Law and Finance: Navigating a New Era of Autonomous Systems
Bommarito, M. J.. 2026.
Comprehensive guide to designing, deploying, and governing agentic AI in regulated industries.
Legal Informatics
Katz, D. M., Dolin, R., Bommarito, M. J.. Cambridge University Press, 2021.
Foundational textbook on computational approaches to law.
Selected publications
Moratorium Nation: A Survey of Data Center, Renewable Energy, and Battery Storage Moratoria in the United States
Bommarito, M. J.. SSRN, 2026.
Comprehensive survey of moratoria affecting data center, renewable energy, and battery storage development across the United States.
How to Design an AI Agent: Architectures, Protocols, and Technical Evaluation of Agentic AI Systems for Law & Finance
Bommarito, M. J., Katz, D. M., Bommarito, J.. SSRN, 2025.
Architectures, protocols, and technical evaluation frameworks for agentic AI systems in regulated industries.
Governing AI Agents: Risk, Compliance, and Accountability in Law and Finance
Bommarito, J., Katz, D. M., Bommarito, M. J.. SSRN, 2025.
Governance frameworks for autonomous AI agents operating in legal and financial contexts.
The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
Bommarito II, M. J., Bommarito, J., Katz, D. M.. arXiv, 2025.
132M+ copyright-clean documents, 1.35 trillion tokens from 16 sources. The data pipeline behind the first Fairly Trained LLM.
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers
Bommarito, M. J., Katz, D. M., Bommarito, J.. arXiv, 2025.
Custom tokenizers for legal, financial, and preprocessing applications.
Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Bommarito, M. J., Katz, D. M., Bommarito, J.. arXiv, 2025.
High-precision sentence boundary detection for legal document processing at scale.
What is an Agent? A Conceptual Primer and History of Agents and Agentic AI
Bommarito, M. J., Bommarito, J., Katz, D. M.. SSRN, 2025.
Conceptual foundation and history of AI agents, from early expert systems to modern agentic architectures.
Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection
Bommarito, M. J.. arXiv, 2025.
Large-scale dataset for training deep learning models on binary analysis and malware detection tasks.
Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis
Bommarito, M. J.. arXiv, 2025.
Cross-platform tokenizers designed for binary code analysis, enabling ML-based security research.
OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph
Bommarito, M. J.. arXiv, 2025.
Synthetic encyclopedic dictionary and knowledge graph for NLP and semantic analysis.
GPT-4 Passes the Bar Exam
Katz, D. M., Bommarito, M. J., Gao, S., Arredondo, P.. Philosophical Transactions of the Royal Society A, 2024.
Demonstrates GPT-4 passing the Uniform Bar Examination. Featured by CNN, Bloomberg, ABA Journal, and cited by OpenAI.
GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities
Bommarito, J., Bommarito, M., Katz, D. M., Katz, J.. arXiv, 2023.
Evaluates LLM performance on CPA examination tasks. Featured in AICPA Journal of Accountancy.
Natural Language Processing in the Legal Domain
Katz, D. M., Hartung, D., Gerlach, L., Jana, A., Bommarito II, M. J.. arXiv, 2023.
Comprehensive survey of NLP methods and applications in legal text processing.
GPT Takes the Bar Exam
Bommarito II, M., Katz, D. M.. arXiv, 2022.
Initial study testing GPT on the Uniform Bar Examination, preceding the GPT-4 follow-up published in the Royal Society.
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., Androutsopoulos, I., Katz, D. M., Aletras, N.. ACL 2022, 2022.
Benchmark dataset for evaluating NLP models on legal text. 422+ citations.
LexNLP: Natural Language Processing and Information Extraction for Legal and Regulatory Texts
Bommarito, M. J., Katz, D. M., Detterman, E. M.. Research Handbook on Big Data Law, Edward Elgar Publishing, 2021.
Open-source NLP library for legal text extraction and analysis. 102+ citations.
An Empirical Analysis of the Python Package Index (PyPI)
Bommarito, E., Bommarito, M.. arXiv, 2019.
Empirical analysis of 178,592 packages and 1.7M releases in the Python ecosystem. Informs dependency risk analysis in technology due diligence.
Harnessing Legal Complexity
Ruhl, J. B., Katz, D. M., Bommarito, M. J.. Science, 2017.
Models legal systems as complex adaptive systems. Published in Science.
A General Approach for Predicting the Behavior of the Supreme Court of the United States
Katz, D. M., Bommarito II, M. J., Blackman, J.. PLoS ONE, 2017.
Machine learning model predicting Supreme Court decisions with 70%+ accuracy.
Open-source projects
KL3M
Copyright-clean language model family (132M+ documents, 1.35T tokens)
LexNLP
Natural language processing library for legal text (102+ citations)
OpenEDGAR
Open-source SEC EDGAR data pipeline and analysis tools
KL3M Tokenizers
Domain-specific and character-level tokenizers for legal, financial, and preprocessing applications