A powerful lexical analyzer (lexer) written in C++ with support for multiple programming languages and advanced features. This tool tokenizes source code into meaningful lexical elements such as keywords, identifiers, operators, and literals.
- Lex - Advanced Lexical Analyzer
-
Multiple Language Support: Predefined configurations for C, C++, Java, Python, and JavaScript
-
Enhanced Plugin System:
- Load and register custom language definitions from JSON files
- Extend the analyzer with new languages without recompiling
- Export and share language configurations
- Prioritized plugin loading for improved language handling
-
Advanced Token Recognition:
- Regular expression based pattern matching
- Complete number formats: decimal, hex, octal, binary, scientific notation
- Support for multi-line and documentation comments
- Unicode support
- Preprocessor directives
-
Symbol Table Construction: Tracks identifiers and their scopes
-
Export Capabilities:
- Export tokens to JSON, XML, CSV, and HTML formats
- Visual token stream representation in HTML
- Export language configurations to JSON
-
Detailed Error Reporting: Precise error location and context information
-
Performance Metrics: Track lexing performance statistics
- C++ compiler with C++17 support (GCC 7+, Clang 5+, or MSVC 2017+)
- Make build system (optional, but recommended)
-
Clone the repository:
git clone https://github.com/adhamafis/lex.git cd lex -
Build using Make:
make -
Verify the installation:
./lex --help
lex/
├── src/ # Source code
│ ├── main.cpp # Entry point
│ ├── Lexer.h/cpp # Core lexer implementation
│ ├── Token.h/cpp # Token definitions
│ ├── SymbolTable.h/cpp # Symbol table implementation
│ ├── LanguageConfig.h/cpp # Language configurations
│ ├── ConfigLoader.h/cpp # JSON configuration loading
│ ├── LanguagePlugin.h/cpp # Plugin system
│ ├── wasm_bindings.cpp # WebAssembly bindings
│ └── ExportFormatter.h/cpp # Output formatting
├── include/ # External libraries
│ └── json.hpp # nlohmann/json library
├── plugins/ # Language plugin definitions
│ ├── c_config.json # C language definition
│ ├── cpp_config.json # C++ language definition
│ ├── java_config.json # Java language definition
│ ├── js_config.json # JavaScript language definition
│ └── python_config.json # Python language definition
├── web/ # Web interface files
│ ├── index.html # Web UI
│ ├── app.js # Application logic
│ └── lex.js/wasm/data # WebAssembly build outputs
├── tests/ # Test files
├── .github/ # GitHub configuration
│ └── workflows/ # GitHub Actions workflows
├── Makefile # Build configuration
├── emscripten.mk # WebAssembly build configuration
├── .gitignore # Git ignore file
├── README.md # This file
└── LICENSE # MIT License
The lexer is designed with the following components:
- Lexer: The core component that reads source code and produces tokens
- TokenStream: A sequence of tokens that can be traversed
- SymbolTable: Maintains a database of identifiers and their scopes
- LanguageConfig: Defines language-specific rules and patterns
- ConfigLoader: Loads and saves language configurations from/to JSON
- LanguagePluginManager: Manages custom language plugins
- ExportFormatter: Formats token output in various formats
Usage: lex [options] [file]
Options:
-i, --interactive Start in interactive mode
-l, --language <lang> Specify language (c, cpp, java, python, js)
-c, --config <file> Use custom language configuration file
-p, --plugins-dir <dir> Specify plugins directory (default: ./plugins)
-v, --verbose Show detailed token information
-e, --export <format> Export tokens in format (json, xml, csv, html)
-o, --output <file> Output file for export
--export-config <lang> <file> Export language config to a JSON file
--list-plugins List available language plugins
-h, --help Display this help message
Run the lexer in interactive mode:
./lex -i
or with a specific language:
./lex -i -l python
In this mode, you can type expressions and see the resulting tokens. Type exit to quit or language <n> to change the active language.
To analyze a file:
./lex path/to/file
With language specification:
./lex -l java path/to/file.java
With verbose output:
./lex -v path/to/file
Export tokens to various formats:
./lex -l cpp -e json -o tokens.json path/to/file.cpp
Available export formats:
json: Structured JSON formatxml: XML documentcsv: Comma-separated valueshtml: Interactive HTML visualization
List available language plugins:
./lex --list-plugins
Export a language configuration to a JSON file:
./lex --export-config js my_javascript.json
Use a custom configuration file:
./lex -c my_config.json path/to/file
Specify a custom plugins directory:
./lex -p /path/to/plugins -l js file.js
Lex comes with built-in support for:
| Language | Flag | File Extensions |
|---|---|---|
| C | -l c |
.c, .h |
| C++ | -l cpp |
.cpp, .hpp, .cc |
| Java | -l java |
.java |
| Python | -l python |
.py |
| JavaScript | -l js |
.js |
Lex features a flexible and enhanced plugin system that allows loading and prioritizing language definitions from JSON files without recompiling the application.
Place JSON language definition files in the plugins/ directory to make them automatically available:
cp my_language.json plugins/
./lex --list-plugins # Should show your new language
./lex -l my_language file.txt
The plugin system now prioritizes custom plugins over built-in language configurations, allowing for more flexible customization.
Create a JSON file with the following structure:
{
"name": "MyLanguage",
"version": "1.0",
"keywords": ["if", "else", "while", "for"],
"types": ["int", "float", "bool"],
"characterSets": {
"identifierStart": "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
"identifierContinue": "_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789",
"operators": "+-*/=<>!&|^~?:",
"delimiters": "()[]{},;.",
"whitespace": " \\t\\n\\r\\f\\v"
},
"commentConfig": {
"singleLineCommentStarts": ["//"],
"multiLineCommentDelimiters": [
{ "start": "/*", "end": "*/" }
]
},
"stringConfig": {
"stringDelimiters": [
{ "start": "\"", "end": "\"" }
],
"escapeChar": "\\"
},
"numberConfig": {
"decimalIntPattern": "\\b[0-9]+\\b",
"floatingPointPattern": "\\b[0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+\\b",
"hexPattern": "\\b0[xX][0-9a-fA-F]+\\b",
"octalPattern": "\\b0[oO][0-7]+\\b",
"binaryPattern": "\\b0[bB][01]+\\b",
"scientificPattern": "\\b[0-9]+(\\.[0-9]+)?[eE][+-]?[0-9]+\\b"
},
"tokenRules": [
{
"name": "Identifier",
"pattern": "[a-zA-Z_][a-zA-Z0-9_]*",
"type": 1,
"precedence": 1
},
// Additional token rules...
]
}$ ./lex -i -l cpp
Lex Interactive Mode (Language: C++)
Enter 'exit' to quit, 'language <n>' to change language
> int main() { return 0; }
Lexical Analysis Results:
------------------------
Total tokens: 9
Processing time: 1 ms
No lexical errors detected.
Tokens:
KEYWORD: 'int' at :1:1
IDENTIFIER: 'main' at :1:5 [undeclared]
PARENTHESIS: '(' at :1:9
PARENTHESIS: ')' at :1:10
BRACE: '{' at :1:12
KEYWORD: 'return' at :1:14
INTEGER: '0' at :1:21 [decimal]
SEMICOLON: ';' at :1:22
BRACE: '}' at :1:24
EOF: '' at :1:26
Symbol Table:
Symbol Table: 1 unique symbols
Scope: global (Global) at :0:0 [1 symbols]
$ ./lex -v path/to/file.cpp
Processing file: path/to/file.cpp (Language: C++)
Lexical Analysis Results:
------------------------
Total tokens: 145
Processing time: 12 ms
No lexical errors detected.
Tokens:
[... token listing ...]
$ ./lex -l cpp -e html -o tokens.html path/to/file.cpp
Tokens exported to tokens.html in html format.
The HTML output provides a visual representation of the token stream with syntax highlighting.
# List available plugins
$ ./lex --list-plugins
Available language plugins:
- javascript
- mylanguage
# Use a plugin for analysis
$ ./lex -l mylanguage source.myext
Processing file: source.myext (Language: MyLanguage)
...
# Export a built-in language as a starting point for customization
$ ./lex --export-config cpp custom_cpp.json
Language configuration for C++ exported to: custom_cpp.json
The lexer can be extended with custom language configurations:
// In your code
LanguageConfig customConfig;
customConfig.setName("MyLanguage");
customConfig.addKeywords({"if", "else", "while", "for"});
// ... more configuration
Lexer lexer(source, customConfig);// In your code
Lexer lexer(source);
std::vector<Token> tokens = lexer.tokenize();
// Process tokens manually
for (const auto& token : tokens) {
// Do something with each token
}-
Compilation errors
- Make sure you have a C++17 compatible compiler
- Check that all dependencies are installed
-
Language detection issues
- Explicitly specify the language with the
-lflag - Check that your file extension is recognized
- Explicitly specify the language with the
-
Export failures
- Ensure the output directory exists and is writable
- Check that the specified format is supported
-
Plugin loading issues
- Verify that the JSON syntax in your plugin file is correct
- Ensure regex patterns are properly escaped
- Use the
--list-pluginsoption to check if your plugin is detected
For detailed debug output, compile with:
make debug
The lexer can be easily extended with:
-
New Language Support:
- Add new language configurations in
LanguageConfig.cpp - Or create JSON plugin files in the
plugins/directory
- Add new language configurations in
-
Additional Token Types: Extend the
TokenTypeenum inToken.h -
Custom Export Formats: Create new exporters in
ExportFormatter.cpp -
Adding New Features:
- Implement additional symbol table functionality
- Add more advanced error recovery mechanisms
- Extend token attribute handling
This project uses the following third-party libraries:
- nlohmann/json: A modern C++ JSON library
- Version: 3.11.3
- License: MIT
- Used for parsing and generating JSON configuration files
Lex is also available as a WebAssembly module that can run directly in your browser. You can try it online at:
https://adhamafis.github.io/lex/
The web version includes all the features of the command-line version, including:
- Support for all built-in language configurations
- Real-time tokenization and analysis
- HTML visualization of the token stream
- Language plugin system
To compile Lex to WebAssembly for web browsers:
-
Install the Emscripten compiler toolkit:
# Using a package manager brew install emscripten # macOS # Or manually git clone https://github.com/emscripten-core/emsdk.git cd emsdk ./emsdk install latest ./emsdk activate latest source ./emsdk_env.sh
-
Build using the emscripten makefile:
make -f emscripten.mk
-
The compiled files will be placed in the
web/directorylex.js- JavaScript glue codelex.wasm- WebAssembly binarylex.data- Preloaded plugin data- Various support files for the module
-
Open
web/index.htmlin a browser to test locally:cd web && python -m http.server
-
Visit http://localhost:8000 in your browser
The web version is automatically deployed to GitHub Pages whenever changes are pushed to the main branch using GitHub Actions. The deployment workflow:
- Checks out the repository
- Sets up Emscripten
- Builds the WebAssembly module
- Copies all plugin configurations to the web directory
- Deploys the contents of the web directory to the gh-pages branch
This ensures that the latest version of Lex is always available online.
The web version will automatically load language plugins from the plugins/ directory. You can add your own language definitions there to make them available in the web interface. The enhanced plugin system provides:
- Automatic plugin detection and registration
- Support for custom language configurations
- Improved error logging for plugin loading issues
- Prioritization of plugins over built-in languages
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure your code follows the project's coding style and includes appropriate tests.
This project is open source and available under the MIT License.