ONE - On-device Neural Engine
Loading...
Searching...
No Matches
ggma_tokenize.h File Reference
#include "ggma_types.h"

Go to the source code of this file.

Typedefs

typedef struct ggma_tokenizer ggma_tokenizer
 Opaque handle to a GGMA tokenizer.
 

Functions

GGMA_STATUS ggma_create_tokenizer (ggma_tokenizer **tokenizer, const char *tokenizer_path)
 Creates a GGMA tokenizer from a specified tokenizer path.
 
GGMA_STATUS ggma_free_tokenizer (ggma_tokenizer *tokenizer)
 Frees all resources associated with a GGMA tokenizer.
 
GGMA_STATUS ggma_tokenize (const ggma_tokenizer *tokenizer, const char *text, size_t text_len, int32_t *tokens, size_t n_tokens_max, size_t *n_tokens)
 Tokenizes an input text string into a sequence of token IDs.
 
GGMA_STATUS ggma_detokenize (const ggma_tokenizer *tokenizer, const int32_t *tokens, size_t n_tokens, char *text, size_t text_len)
 Detokenizes a sequence of token IDs back into a text string.
 

Typedef Documentation

◆ ggma_tokenizer

Opaque handle to a GGMA tokenizer.

A GGMA tokenizer encapsulates all necessary components for text tokenization, including the tokenizer model and vocabulary.

Definition at line 36 of file ggma_tokenize.h.

Function Documentation

◆ ggma_create_tokenizer()

GGMA_STATUS ggma_create_tokenizer ( ggma_tokenizer **  tokenizer,
const char *  tokenizer_path 
)

Creates a GGMA tokenizer from a specified tokenizer path.

This function loads the necessary tokenizer components from the given tokenizer path and initializes a GGMA tokenizer handle.

Parameters
[out]tokenizerPointer to the tokenizer object created from the given path
[in]tokenizer_pathThe path to the directory containing the tokenizer model and vocabulary
Returns
GGMA_STATUS_NO_ERROR on success, or an appropriate error code on failure (e.g., GGMA_STATUS_UNEXPECTED_NULL if tokenizer_path or tokenizer is NULL, or if the tokenizer cannot be created).

Definition at line 25 of file ggma_tokenize.cc.

26{
27 if (!tokenizer || !tokenizer_path)
29
30 try
31 {
32 std::string tokenizer_id = "sentencepiece";
33 auto impl = ggma::TokenizerFactory::create(tokenizer_id, tokenizer_path);
34
35 *tokenizer = reinterpret_cast<ggma_tokenizer *>(impl);
37 }
38 catch (...)
39 {
40 return GGMA_STATUS_ERROR;
41 }
42}
static Tokenizer * create(const std::string &id, const std::string &tokenizer_dir)
struct ggma_tokenizer ggma_tokenizer
Opaque handle to a GGMA tokenizer.
@ GGMA_STATUS_NO_ERROR
Definition ggma_types.h:37
@ GGMA_STATUS_UNEXPECTED_NULL
Definition ggma_types.h:44
@ GGMA_STATUS_ERROR
Definition ggma_types.h:42
Definition Mean.cpp:30

References ggma::TokenizerFactory::create(), GGMA_STATUS_ERROR, GGMA_STATUS_NO_ERROR, and GGMA_STATUS_UNEXPECTED_NULL.

◆ ggma_detokenize()

GGMA_STATUS ggma_detokenize ( const ggma_tokenizer tokenizer,
const int32_t *  tokens,
size_t  n_tokens,
char *  text,
size_t  text_len 
)

Detokenizes a sequence of token IDs back into a text string.

This function uses the vocabulary from the created tokenizer to convert the sequence of token IDs back into a human-readable text string.

Parameters
[in]tokenizerThe GGMA tokenizer handle for detokenization.
[in]tokensA pointer to the input buffer containing the token IDs to be detokenized.
[in]n_tokensThe number of tokens in the tokens buffer.
[out]textA pointer to the output buffer where the detokenized text will be stored.
[in]text_lenThe maximum size of the text buffer in bytes.
Returns
GGMA_STATUS_NO_ERROR if successful, or an appropriate error code on failure (e.g., GGMA_STATUS_UNEXPECTED_NULL if tokenizer or tokens is NULL, or if the output buffer is too small).

Definition at line 79 of file ggma_tokenize.cc.

81{
82 if (!tokenizer || !tokens || !text)
84
85 try
86 {
87 auto impl = reinterpret_cast<const ggma::Tokenizer *>(tokenizer);
88 impl->detokenize(tokens, n_tokens, text, text_len);
90 }
91 catch (...)
92 {
93 return GGMA_STATUS_ERROR;
94 }
95}

References GGMA_STATUS_ERROR, GGMA_STATUS_NO_ERROR, and GGMA_STATUS_UNEXPECTED_NULL.

◆ ggma_free_tokenizer()

GGMA_STATUS ggma_free_tokenizer ( ggma_tokenizer tokenizer)

Frees all resources associated with a GGMA tokenizer.

Parameters
[in]tokenizerThe GGMA tokenizer to free. This handle will be invalid after the call.
Returns
GGMA_STATUS_NO_ERROR if successful, or an appropriate error code on failure.

Definition at line 44 of file ggma_tokenize.cc.

45{
46 if (!tokenizer)
48
49 try
50 {
51 auto impl = reinterpret_cast<ggma::Tokenizer *>(tokenizer);
52 delete impl;
54 }
55 catch (...)
56 {
57 return GGMA_STATUS_ERROR;
58 }
59}

References GGMA_STATUS_ERROR, GGMA_STATUS_NO_ERROR, and GGMA_STATUS_UNEXPECTED_NULL.

◆ ggma_tokenize()

GGMA_STATUS ggma_tokenize ( const ggma_tokenizer tokenizer,
const char *  text,
size_t  text_len,
int32_t *  tokens,
size_t  n_tokens_max,
size_t *  n_tokens 
)

Tokenizes an input text string into a sequence of token IDs.

This function uses the vocabulary from the created tokenizer to convert the input text into a series of numerical token IDs.

Parameters
[in]tokenizerThe GGMA tokenizer handle for tokenization.
[in]textThe null-terminated text string to be tokenized.
[in]text_lenThe length of the text in bytes. If the text is null-terminated, this can be 0 and the length will be determined internally.
[out]tokensOutput buffer for generated token IDs.
[in]n_tokens_maxMaximum number of tokens the tokens buffer can hold.
[out]n_tokensA pointer to a variable that will receive the actual number of tokens written to the tokens buffer.
Returns
GGMA_STATUS_NO_ERROR if successful, or an appropriate error code on failure (e.g., GGMA_STATUS_UNEXPECTED_NULL if tokenizer or text is NULL, or if the output buffer is too small).

Definition at line 61 of file ggma_tokenize.cc.

63{
64 if (!tokenizer || !text || !tokens || !n_tokens)
66
67 try
68 {
69 auto impl = reinterpret_cast<const ggma::Tokenizer *>(tokenizer);
70 impl->tokenize(text, text_len, tokens, n_tokens_max, n_tokens);
72 }
73 catch (...)
74 {
75 return GGMA_STATUS_ERROR;
76 }
77}

References GGMA_STATUS_ERROR, GGMA_STATUS_NO_ERROR, and GGMA_STATUS_UNEXPECTED_NULL.