On-device large language models using Keras and TensorFlow Lite

Deploying large language models to Android devices is a complex task as these models often require extensive computing resources. First, you need to load and optimize the model, then choose a suitable inference engine, and finally integrate it into your Android application. Here are the general steps:

1. Load and optimize the model:

  • If you have built a large language model using Keras, you first need to load the model into Python.
  • Convert the model to TensorFlow Lite format using TensorFlow's conversion tools, such as TensorFlow Lite Converter, which can reduce model size and improve performance on mobile devices.
  • Models can also be compressed using quantization techniques to reduce memory and computational requirements.

2. Select an inference engine:

  • Running models on Android devices requires choosing an appropriate inference engine. TensorFlow Lite and ONNX Runtime are common choices for running various types of models on Android.
  • Evaluate the performance of different engines and choose the one that best suits your model and application needs.

3. Integrate into Android applications:

  • Create an Android application project and integrate the model files and inference engine into the project.
  • Use Android Studio to develop app interfaces so users can interact with the model.
  • Set up an inference pipeline in your app to process user input and get the model's output.

4. Optimize performance:

  • Running large language models on mobile devices can cause performance issues, especially memory and computational requirements.
  • Optimize model inference, such as batch processing of inputs, reducing memory footprint, using GPU acceleration, etc.
  • Consider using model caching to avoid repeated loading of models.

5. Testing and Debugging:

  • Test on real devices to ensure the model performs well on mobile devices.
  • Perform performance analysis and debugging to resolve any performance or functionality issues.

6. Security and Privacy:

  • Implement necessary security and privacy measures in your applications, especially when handling sensitive data.
  • Ensure safe storage and transmission of models and data.

7. Publish the application:

  • Prepare the application for publishing to the Google Play Store or other appropriate app stores.

Please note that this is an advanced task and requires in-depth knowledge of mobile application development and deep learning. You may need to learn more about Android development, TensorFlow Lite or ONNX Runtime, and model optimization techniques to successfully deploy large language models to Android devices. At the same time, ensure compliance with relevant regulations and privacy policies to protect user data.

Related examples

One of the most exciting recent machine learning breakthroughs is large language models (LLMs). These models can be used to generate text, translate languages, and answer questions in a comprehensive and informative way. LLMs such as Google LaMDA and PaLM have been trained on large amounts of text data, allowing them to learn statistical patterns and relationships between words and phrases. In this way, they can produce text that resembles human writing and can be translated with great accuracy.

LLMs take up a lot of storage space and often require a lot of computing power to run, which means they are often deployed in the cloud and are extremely challenging for on-device machine learning (ODML) due to the limited computing power on mobile devices. sex. However, you can also run smaller-scale LLMs (such as GPT-2) on newer Android devices and still achieve impressive results.

In this codelab, you'll learn how to build an application powered by LLM by:

  • Load pretrained LLM (GPT-2) using KerasNLP
  • Fine-tuning LLM (GPT-2) using KerasNLP
  • Convert, optimize, and deploy LLM on Android using TensorFlow Lite

Prerequisites

  • Intermediate knowledge of Keras and TensorFlow Lite
  • Master the basics of Android development

Learning Content

  • How to load and fine-tune pre-trained LLM using KerasNLP
  • How to quantize LLM and convert it to TensorFlow Lite
  • How to use the converted TensorFlow Lite model for inference

About KerasNLP

KerasNLP is a natural language processing library that supports users throughout their development cycle. Our workflows are built from modular components that come with state-of-the-art preset weights and architectures out of the box and are easily customizable when more control is needed. We emphasize in-graph computation for all workflows so that developers can easily implement production using the TensorFlow ecosystem. The library is an extension of the core Keras API; all advanced modules and layers receive Models with the same level of polish as core Keras.

  • Installation Environment
pip install git+https://github.com/keras-team/keras-nlp.git --upgrade
  • quick start

Use the API to fine-tune BERT keras_nlp.models for small sentiment analysis tasks:

import keras_nlp
import tensorflow_datasets as tfds

imdb_train, imdb_test = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    as_supervised=True,
    batch_size=16,
)
# Load a BERT model.
classifier = keras_nlp.models.BertClassifier.from_preset("bert_base_en_uncased")
# Fine-tune on IMDb movie reviews.
classifier.fit(imdb_train, validation_data=imdb_test)
# Predict two new examples.
classifier.predict(["What an amazing movie!", "A total waste of my time."])

Deploy LLM to the device side

Android application use

  • Github :examples/lite/examples/generative_ai

Initialize the model
The initModel method is used to initialize the TensorFlow-Lite model. It first attempts to load the model file and then instantiates an Interpreter object to run the model.

override suspend fun initModel(): InitModelResult {
    
    
    return withContext(dispatcher) {
    
    
        // Load model file
        val loadResult = loadModelFile(context)
        // Determine if load was successful
        if (loadResult.isFailure) {
    
    
            val exc = loadResult.exceptionOrNull()
            return@withContext if (exc is FileNotFoundException) {
    
    
                InitModelResult.Error(AutoCompleteServiceError.MODEL_FILE_NOT_FOUND)
            } else {
    
    
                InitModelResult.Error(AutoCompleteServiceError.MODEL_NOT_INITIALIZED)
            }
        }
        // Instantiate interpreter with loaded model
        val model = loadResult.getOrNull()
        isInitialized = model?.let {
    
    
            interpreter = Interpreter(it)
            true
        } ?: false
        if (isInitialized) InitModelResult.Success
        else InitModelResult.Error(AutoCompleteServiceError.MODEL_NOT_INITIALIZED)
    }
}

Run the model

The runInterpreterOn method is used to run the TensorFlow-Lite model and generate new text based on the input text.

@WorkerThread
private fun runInterpreterOn(input: String): String {
    
    
    outputBuffer.clear()
    // Run interpreter, which will generate text into outputBuffer
    interpreter.run(input, outputBuffer)
    // Set output buffer limit to current position & position to 0
    outputBuffer.flip()
    // Get bytes from output buffer
    val bytes = ByteArray(outputBuffer.remaining())
    outputBuffer.get(bytes)
    outputBuffer.clear()
    // Return bytes converted to String
    return String(bytes, Charsets.UTF_8)
}

user's guidance

Step 1. Train the language model using Keras
In this demonstration, we will use KerasNLP to obtain the GPT-2 model. KerasNLP is a library that contains state-of-the-art pre-trained models for natural language processing tasks and supports users throughout their development cycle. You can view the list of available models in the KerasNLP repository. Workflows are built from modular components that come with state-of-the-art preset weights and architectures out of the box and can be easily customized when more control is needed. Creating a GPT-2 model can be accomplished by following these steps:

gpt2_tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en")

gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
  "gpt2_base_en",
  sequence_length=256,
  add_end_token=True,
)

gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
  "gpt2_base_en", 
  preprocessor=gpt2_preprocessor,
)

You can view the complete GPT-2 model implementation on GitHub.

Step 2. Convert Keras model to TFLite model
generate() starts from the GPT2CausalLM function that performs the conversion. Wrap the generate() function to create a concrete TensorFlow function:

@tf.function
def generate(prompt, max_length):
  # prompt: input prompt to the LLM in string format
  # max_length: the max length of the generated tokens 
  return gpt2_lm.generate(prompt, max_length)
concrete_func = generate.get_concrete_function(tf.TensorSpec([], tf.string), 100)

Now define a helper function that will run inference using the input and TFLite model. TensorFlow text operations are not built-in operations in the TFLite runtime, so you need to add these custom operations in order for the interpreter to reason about this model. This helper function accepts input and a function to perform the transformation, i.e. the generator() function defined above.

def run_inference(input, generate_tflite):
  interp = interpreter.InterpreterWithCustomOps(
    model_content=generate_tflite,
    custom_op_registerers=tf_text.tflite_registrar.SELECT_TFTEXT_OPS)
  interp.get_signature_list()

  generator = interp.get_signature_runner('serving_default')
  output = generator(prompt=np.array([input]))

You can now convert the model:

gpt2_lm.jit_compile = False
converter = tf.lite.TFLiteConverter.from_concrete_functions(
  [concrete_func],
  gpt2_lm)

converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TFLite ops
  tf.lite.OpsSet.SELECT_TF_OPS, # enable TF ops
]
converter.allow_custom_ops = True
converter.target_spec.experimental_select_user_tf_ops = [
  "UnsortedSegmentJoin",
  "UpperBound"
]
converter._experimental_guarantee_all_funcs_one_use = True
generate_tflite = converter.convert()
run_inference("I'm enjoying a", generate_tflite)

Step 3. Quantization
TensorFlow Lite implements an optimization technique called quantization that can reduce model size and speed up inference. Through the quantization process, 32-bit floating point numbers are mapped into smaller 8-bit integers, thus reducing the model size by a factor of 4 for more efficient execution on modern hardware. There are many ways to do quantization in TensorFlow. You can visit the TFLite Model Optimization and TensorFlow Model Optimization Toolkit pages for more information. The types of quantification are briefly explained below.

Here you will use post-training dynamic range quantization on the GPT-2 model by setting the converter optimization flag to tf.lite.Optimize.DEFAULT, and the rest of the conversion process is the same as detailed before. Our testing found that using this quantization technique, with the maximum output length set to 100, the latency on the Pixel 7 was approximately 6.7 seconds.

gpt2_lm.jit_compile = False
converter = tf.lite.TFLiteConverter.from_concrete_functions(
  [concrete_func],
  gpt2_lm)

converter.target_spec.supported_ops = [
  tf.lite.OpsSet.TFLITE_BUILTINS, # enable TFLite ops
  tf.lite.OpsSet.SELECT_TF_OPS, # enable TF ops
]
converter.allow_custom_ops = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.experimental_select_user_tf_ops = [
  "UnsortedSegmentJoin",
  "UpperBound"
]
converter._experimental_guarantee_all_funcs_one_use = True
quant_generate_tflite = converter.convert()
run_inference("I'm enjoying a", quant_generate_tflite)

with open('quantized_gpt2.tflite', 'wb') as f:
  f.write(quant_generate_tflite)

Step 4. Android App Integration
You can clone this repository and replace android/app/src/main/assets/autocomplete.tflitewith the converted quant_generate_tflite file.

Related references

On-device large-scale language models using Keras and TensorFlow Lite
https://codelabs.developers.google.com/kerasnlp-tflite

Guess you like

Origin blog.csdn.net/weixin_44008788/article/details/132740900