Introduction to h2o project
Query and summarize your documents, or just chat with your local private GPT LLM using h2oGPT, an Apache V2 open source project.
Project address
https://github.com/h2oai/h2ogpt
test experience address
https://gpt.h2o.ai/
main functions
- Private offline database of any document (PDF, Excel, Word, Image, Code, Text, MarkDown, etc.)
- Persistent database (Chroma, Weaviate, or in-memory FAISS) using precisely embedded (large, full MiniLM-L6-v2, etc.)
- Efficient use of context using instruction-tuned LLM (no need for LangChain's few-shot approach)
- Parallel aggregation up to 80 tokens/sec output 13B LLaMa2
- Upload and view documents through the UI (controlling multiple collaborations or ad-hoc collections)
- UI or CLI and streaming of all models
- Simultaneously target multiple models for UI mode
- Multiple models supported (LLaMa2, Falcon, Vicuna, WizardLM including AutoGPTQ, 4-bit/8-bit, LORA)
- GPU support for HF and LLaMa.cpp GGML models, and CPU support for models using HF, LLaMa.cpp and GPT4ALL
- Linux, Docker, MAC and Windows support
- Inference server support (HF TGI server, vLLM, Gradio, ExLLaMa, OpenAI)
- OpenAI-compliant Python client API for client-server control
- Evaluate performance using reward models
Various models and data sets download address
https://huggingface.co/h2oai
Assess
and upload files Note here that you can upload various types of common local files.
Supported native data types
.pdf: Portable Document Format (PDF),
.txt: Text File (UTF-8),
.csv: CSV,
.toml: Toml,
.py: Python,
.rst: Restructured Text ,
.rtf: Rich Text Format,
.md: Markdown,
.html: HTML file,
.docx: Word document (optional),
.doc: Word document (optional),
.xlsx: Excel document (optional),
. xls: Excel document (optional),
.enex: Evernote,
.eml: email,
.epub: e-book,
.odt: open document text,
.pptx: PowerPoint document,
.ppt: PowerPoint document,
.png: PNG image (optional),
.jpg: JPEG image (optional),
.jpeg: JPEG image (optional).
To generate an answer, you can see that after asking a question, multiple models answer at the same time, and the user can choose an answer that he feels is more reasonable.
Document Management
You can view and manage your uploaded documents.
Chat history management:
custom output configuration
deployment
1: Download Visual Studio 2022
2: Download the MinGW installer
3: Download and install Miniconda
4: Install dependencies
# Required for Doc Q/A: LangChain:
pip install -r reqs_optional/requirements_optional_langchain.txt
# Required for CPU: LLaMa/GPT4All:
pip install -r reqs_optional/requirements_optional_gpt4all.txt
# Optional: PyMuPDF/ArXiv:
pip install -r reqs_optional/requirements_optional_langchain.gpllike.txt
# Optional: Selenium/PlayWright:
pip install -r reqs_optional/requirement s_optional_langchain.urls.txt
# Optional: for supporting unstructured package
python -m nltk.downloader all
5: Optional Configuration
6: Run
For document Q/A with UI using LLaMa.cpp-based model on CPU or GPU:
Click Download Wizard Model and place file in h2oGPT directory.
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
7:使用和分享
Starting get_model: llama
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti
Device 1: NVIDIA GeForce RTX 2080
llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 1792
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required = 4518.85 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 368 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 20 repeating layers to GPU
llama_model_load_internal: offloaded 20/35 layers to GPU
llama_model_load_internal: total VRAM used: 4470 MB
llama_new_context_with_model: kv self size = 896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'wizard2', 'prompt_dict': {'promptA': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'promptB': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.', 'PreInstruct': '\n### Instruction:\n', 'PreInput': None, 'PreResponse': '\n### Response:\n', 'terminate_response': ['\n### Response:\n'], 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '\n### Instruction:\n', 'botstr': '\n### Response:\n', 'generates_leading_space': False}}
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Go to http://127.0.0.1:7860 (ignore the message above). Add --share=True for shareable secure links.
To chat with LLM only, in Collections click Resources and click LLM, or without --langchain_mode=UserData.
In nvidia-smi or some other GPU monitor program, you should see python.exe Use GPUC in (compute) mode and use GPU resources.
On an i9 with a 3090Ti, about 5 tokens are obtained per second.
If you have multiple GPUs, it's a good idea to specify the fast GPU (for example, if device 0 is the fastest GPU with the most memory) by doing the following.
Friends who are interested, go and try it out!