Tokenizer is a BPE (Byte Pair Encoding) tokenizer designed for use with OpenAI large language models (LLMs). Developed by Microsoft, this tool provides implementations in both TypeScript and C#, enabling seamless tokenization across Node.js and .NET environments.
Key Features:
Cross-Language Support: Offers TypeScript and C# implementations to cater to a wide range of development ecosystems.
Compatibility with OpenAI Models: Based on the open-source Rust implementation from OpenAI's tiktoken, ensuring compatibility and consistency with OpenAI LLMs.
Migration Path for .NET Users: Provides guidance for transitioning from Microsoft.DeepDev.TokenizerLib to Microsoft.ML.Tokenizers, a more performant tokenizer library under active development by the .NET team.
Audience & Benefit:
Ideal for developers working with OpenAI language models in TypeScript or C# environments. Tokenizer allows users to efficiently tokenize prompts before feeding them into LLMs, ensuring accurate token counts and optimizing costs associated with model usage. This tool is particularly valuable for streamlining workflows in AI development and application.
The software can be installed via winget, making it easy to integrate into development pipelines.
README
Tokenizer
This repo contains Typescript and C# implementation of byte pair encoding(BPE) tokenizer for OpenAI LLMs, it's based on open sourced rust implementation in the OpenAI tiktoken. Both implementation are valuable to run prompt tokenization in Nodejs and .NET environment before feeding prompt into a LLM.
> [!IMPORTANT]
> Users of Microsoft.DeepDev.TokenizerLib should migrate to Microsoft.ML.Tokenizers. The functionality in Microsoft.DeepDev.TokenizerLib has been added to Microsoft.ML.Tokenizers. Microsoft.ML.Tokenizers is a tokenizer library being developed by the .NET team and going forward, the central place for tokenizer development in .NET. By using Microsoft.ML.Tokenizers, you should see improved performance over existing tokenizer library implementations, including Microsoft.DeepDev.TokenizerLib. A stable release of Microsoft.ML.Tokenizers is expected alongside the .NET 9.0 release (November 2024). Instructions for migration can be found at https://github.com/dotnet/machinelearning/blob/main/docs/code/microsoft-ml-tokenizers-migration-guide.md.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
Microsoft's Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.