Research & Publications

Our team has nearly 50 academic publications with 4,000+ citations. We publish in top venues and release open-source tools used across the legal, AI, and software industries.

Books

The Math Inside the Machine: How Intelligence Emerges from Eleven Simple Operations

Bommarito, M. J.. 2026.

Explains the mathematical foundations of AI and machine learning through eleven core operations.

Agentic AI in Law and Finance: Navigating a New Era of Autonomous Systems

Bommarito, M. J.. 2026.

Comprehensive guide to designing, deploying, and governing agentic AI in regulated industries.

Legal Informatics

Katz, D. M., Dolin, R., Bommarito, M. J.. Cambridge University Press, 2021.

Foundational textbook on computational approaches to law.

Selected publications

Moratorium Nation: A Survey of Data Center, Renewable Energy, and Battery Storage Moratoria in the United States

Bommarito, M. J.. SSRN, 2026.

Comprehensive survey of moratoria affecting data center, renewable energy, and battery storage development across the United States.

How to Design an AI Agent: Architectures, Protocols, and Technical Evaluation of Agentic AI Systems for Law & Finance

Bommarito, M. J., Katz, D. M., Bommarito, J.. SSRN, 2025.

Architectures, protocols, and technical evaluation frameworks for agentic AI systems in regulated industries.

Governing AI Agents: Risk, Compliance, and Accountability in Law and Finance

Bommarito, J., Katz, D. M., Bommarito, M. J.. SSRN, 2025.

Governance frameworks for autonomous AI agents operating in legal and financial contexts.

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Bommarito II, M. J., Bommarito, J., Katz, D. M.. arXiv, 2025.

132M+ copyright-clean documents, 1.35 trillion tokens from 16 sources. The data pipeline behind the first Fairly Trained LLM.

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers

Bommarito, M. J., Katz, D. M., Bommarito, J.. arXiv, 2025.

Custom tokenizers for legal, financial, and preprocessing applications.

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

Bommarito, M. J., Katz, D. M., Bommarito, J.. arXiv, 2025.

High-precision sentence boundary detection for legal document processing at scale.

What is an Agent? A Conceptual Primer and History of Agents and Agentic AI

Bommarito, M. J., Bommarito, J., Katz, D. M.. SSRN, 2025.

Conceptual foundation and history of AI agents, from early expert systems to modern agentic architectures.

Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

Bommarito, M. J.. arXiv, 2025.

Large-scale dataset for training deep learning models on binary analysis and malware detection tasks.

Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis

Bommarito, M. J.. arXiv, 2025.

Cross-platform tokenizers designed for binary code analysis, enabling ML-based security research.

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

Bommarito, M. J.. arXiv, 2025.

Synthetic encyclopedic dictionary and knowledge graph for NLP and semantic analysis.

GPT-4 Passes the Bar Exam

Katz, D. M., Bommarito, M. J., Gao, S., Arredondo, P.. Philosophical Transactions of the Royal Society A, 2024.

Demonstrates GPT-4 passing the Uniform Bar Examination. Featured by CNN, Bloomberg, ABA Journal, and cited by OpenAI.

GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities

Bommarito, J., Bommarito, M., Katz, D. M., Katz, J.. arXiv, 2023.

Evaluates LLM performance on CPA examination tasks. Featured in AICPA Journal of Accountancy.

Natural Language Processing in the Legal Domain

Katz, D. M., Hartung, D., Gerlach, L., Jana, A., Bommarito II, M. J.. arXiv, 2023.

Comprehensive survey of NLP methods and applications in legal text processing.

GPT Takes the Bar Exam

Bommarito II, M., Katz, D. M.. arXiv, 2022.

Initial study testing GPT on the Uniform Bar Examination, preceding the GPT-4 follow-up published in the Royal Society.

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., Androutsopoulos, I., Katz, D. M., Aletras, N.. ACL 2022, 2022.

Benchmark dataset for evaluating NLP models on legal text. 422+ citations.

LexNLP: Natural Language Processing and Information Extraction for Legal and Regulatory Texts

Bommarito, M. J., Katz, D. M., Detterman, E. M.. Research Handbook on Big Data Law, Edward Elgar Publishing, 2021.

Open-source NLP library for legal text extraction and analysis. 102+ citations.

An Empirical Analysis of the Python Package Index (PyPI)

Bommarito, E., Bommarito, M.. arXiv, 2019.

Empirical analysis of 178,592 packages and 1.7M releases in the Python ecosystem. Informs dependency risk analysis in technology due diligence.

Harnessing Legal Complexity

Ruhl, J. B., Katz, D. M., Bommarito, M. J.. Science, 2017.

Models legal systems as complex adaptive systems. Published in Science.

A General Approach for Predicting the Behavior of the Supreme Court of the United States

Katz, D. M., Bommarito II, M. J., Blackman, J.. PLoS ONE, 2017.

Machine learning model predicting Supreme Court decisions with 70%+ accuracy.

Open-source projects

KL3M

Copyright-clean language model family (132M+ documents, 1.35T tokens)

LexNLP

Natural language processing library for legal text (102+ citations)

OpenEDGAR

Open-source SEC EDGAR data pipeline and analysis tools

KL3M Tokenizers

Domain-specific and character-level tokenizers for legal, financial, and preprocessing applications

Ready to talk?

We'll give you a straight answer.