Awesome, h2oGPT directly operates local PDF, Excel, Word, images, and other files

Introduction to h2o project

51bd727f717c878335df39d1ab06afe8.jpeg

Query and summarize your documents, or just chat with your local private GPT LLM using h2oGPT, an Apache V2 open source project.

Project address

https://github.com/h2oai/h2ogpt

test experience address

https://gpt.h2o.ai/

main functions

  • Private offline database of any document (PDF, Excel, Word, Image, Code, Text, MarkDown, etc.)
  • Persistent database (Chroma, Weaviate, or in-memory FAISS) using precisely embedded (large, full MiniLM-L6-v2, etc.)
  • Efficient use of context using instruction-tuned LLM (no need for LangChain's few-shot approach)
  • Parallel aggregation up to 80 tokens/sec output 13B LLaMa2
  • Upload and view documents through the UI (controlling multiple collaborations or ad-hoc collections)
  • UI or CLI and streaming of all models
  • Simultaneously target multiple models for UI mode
  • Multiple models supported (LLaMa2, Falcon, Vicuna, WizardLM including AutoGPTQ, 4-bit/8-bit, LORA)
  • GPU support for HF and LLaMa.cpp GGML models, and CPU support for models using HF, LLaMa.cpp and GPT4ALL
  • Linux, Docker, MAC and Windows support
  • Inference server support (HF TGI server, vLLM, Gradio, ExLLaMa, OpenAI)
  • OpenAI-compliant Python client API for client-server control
  • Evaluate performance using reward models

3b4eaf6b5d59d601df97d3c03662ed36.jpegVarious models and data sets download address

https://huggingface.co/h2oai
7b535494daf113aea5ab8faca2bb1c1d.jpeg

81ed4a444b3d721d88cf4cddb74624ab.jpegAssess

and upload files Note here that you can upload various types of common local files. 8653f9fb6f415749f49161011603f527.jpeg

Supported native data types

.pdf: Portable Document Format (PDF),

.txt: Text File (UTF-8),

.csv: CSV,

.toml: Toml,

.py: Python,

.rst: Restructured Text ,

.rtf: Rich Text Format,

.md: Markdown,

.html: HTML file,

.docx: Word document (optional),

.doc: Word document (optional),

.xlsx: Excel document (optional),

. xls: Excel document (optional),

.enex: Evernote,

.eml: email,

.epub: e-book,

.odt: open document text,

.pptx: PowerPoint document,

.ppt: PowerPoint document,

.png: PNG image (optional),

.jpg: JPEG image (optional),

.jpeg: JPEG image (optional).

To generate an answer, you can see that after asking a question, multiple models answer at the same time, and the user can choose an answer that he feels is more reasonable.
4b0c3cfb03265e0e4c545faabcd8fbf9.jpegDocument Management
You can view and manage your uploaded documents.
3fbf6b14740a13c62fa611c5deea5ba8.jpegChat history management:
b856ff140bfcf2983e0242acb8e74c68.jpeg custom output configuration
8054222844829a7d4b38c014d99d5c8f.jpegdeployment

1: Download Visual Studio 2022
0d5b96394c97d13c87690ad7bc4f532b.jpeg2: Download the MinGW installer
7e328c249d2ac8ae8952aa03c2975729.jpeg3: Download and install Miniconda
da32d5e7c64f7b1e7de28fe1bdb0cfad.jpeg4: Install dependencies
# Required for Doc Q/A: LangChain:
pip install -r reqs_optional/requirements_optional_langchain.txt
# Required for CPU: LLaMa/GPT4All:
pip install -r reqs_optional/requirements_optional_gpt4all.txt
# Optional: PyMuPDF/ArXiv:
pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt
# Optional: Selenium/PlayWright:
pip install -r reqs_optional/requirement s_optional_langchain.urls.txt
# Optional: for supporting unstructured package
python -m nltk.downloader all

5: Optional Configuration
9217f309355bc82a80a44c9c3cdaa332.jpeg6: Run

For document Q/A with UI using LLaMa.cpp-based model on CPU or GPU:

Click Download Wizard Model and place file in h2oGPT directory.

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
2ca41571b11de3818847ea93c8d4b5bd.jpeg7:使用和分享

Starting get_model: llama
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti
Device 1: NVIDIA GeForce RTX 2080
llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 1792
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required = 4518.85 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/35 layers to GPU
llama_model_load_internal: total VRAM used: 4470 MB
llama_new_context_with_model: kv self size = 896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}}
Running on local URL: http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

Go to http://127.0.0.1:7860 (ignore the message above). Add --share=True for shareable secure links.

To chat with LLM only, in Collections click Resources and click LLM, or without --langchain_mode=UserData.

In nvidia-smi or some other GPU monitor program, you should see python.exe Use GPUC in (compute) mode and use GPU resources.

On an i9 with a 3090Ti, about 5 tokens are obtained per second.
82b8d496f097b8c1fcecf9ac38995307.jpegIf you have multiple GPUs, it's a good idea to specify the fast GPU (for example, if device 0 is the fastest GPU with the most memory) by doing the following.

Friends who are interested, go and try it out!

Guess you like

Origin blog.csdn.net/specssss/article/details/132068519