You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The default <ahref="https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english"target="_blank">model</a> used for text classification is a fine-tuned version of DistilBERT-base-uncased that has been specifically optimized for the Stanford Sentiment Treebank dataset (sst2).
244
240
245
-
246
241
*Using specific model*
247
242
248
243
To use one of the over 19,000 models available on Hugging Face, include the name of the desired model and `text-classification` task as a JSONB object in the SQL query. For example, if you want to use a RoBERTa <ahref="https://huggingface.co/models?pipeline_tag=text-classification"target="_blank">model</a> trained on around 40,000 English tweets and that has POS (positive), NEG (negative), and NEU (neutral) labels for its classes, include this information in the JSONB object when making your query.
@@ -681,7 +676,6 @@ SELECT pgml.transform(
681
676
Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word $w_t$ according to its conditional probability distribution:
682
677
$$ w_t \approx P(w_t|w_{1:t-1})$$
683
678
684
-
685
679
However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as temperature, top-k, or top-p. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text.
686
680
687
681
You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both.
@@ -821,7 +815,6 @@ SELECT * from tweet_embeddings limit 2;
821
815
|"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"|{-0.1567948312,-0.3149209619,0.2163394839,..}|
822
816
|"Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ"|{-0.0701668188,-0.012231146,0.1304316372,.. }|
823
817
824
-
825
818
## Step 2: Indexing your embeddings using different algorithms
826
819
After you've created embeddings for your data, you need to index them using one or more indexing algorithms. There are several different types of indexing algorithms available, including B-trees, k-nearest neighbors (KNN), and approximate nearest neighbors (ANN). The specific type of indexing algorithm you choose will depend on your use case and performance requirements. For example, B-trees are a good choice for range queries, while KNN and ANN algorithms are more efficient for similarity searches.
827
820
@@ -860,7 +853,6 @@ SELECT * FROM items, query ORDER BY items.embedding <-> query.embedding LIMIT 5;
860
853
|5 RT's if you want the next episode of twilight princess tomorrow|
861
854
|Jurassic Park is BACK! New Trailer for the 4th Movie, Jurassic World -|
862
855
863
-
864
856
<!-- ## Sentence Similarity
865
857
Sentence Similarity involves determining the degree of similarity between two texts. To accomplish this, Sentence similarity models convert the input texts into vectors (embeddings) that encapsulate semantic information, and then measure the proximity (or similarity) between the vectors. This task is especially beneficial for tasks such as information retrieval and clustering/grouping.
@@ -869,7 +861,6 @@ Sentence Similarity involves determining the degree of similarity between two te
869
861
<!-- # Regression
870
862
# Classification -->
871
863
872
-
873
864
# LLM Fine-tuning
874
865
875
866
In this section, we will provide a step-by-step walkthrough for fine-tuning a Language Model (LLM) for differnt tasks.
@@ -1036,7 +1027,6 @@ Fine-tuning a language model requires careful consideration of training paramete
1036
1027
* hub_token: Your Hugging Face API token to push the fine-tuned model to the Hugging Face Model Hub. Replace "YOUR_HUB_TOKEN" with the actual token.
1037
1028
* push_to_hub: A boolean flag indicating whether to push the model to the Hugging Face Model Hub after fine-tuning.
1038
1029
1039
-
1040
1030
#### 5.3 Monitoring
1041
1031
During training, metrics like loss, gradient norm will be printed as info and also logged in pgml.logs table. Below is a snapshot of such output.
1042
1032
@@ -1151,7 +1141,6 @@ Here is an example pgml.transform call for real-time predictions on the newly mi
1151
1141
Time: 175.264 ms
1152
1142
```
1153
1143
1154
-
1155
1144
**Batch predictions**
1156
1145
1157
1146
```sql
@@ -1247,7 +1236,6 @@ SELECT pgml.tune(
1247
1236
1248
1237
By following these steps, you can effectively restart training from a previously trained model, allowing for further refinement and adaptation of the model based on new requirements or insights. Adjust parameters as needed for your specific use case and dataset.
1249
1238
1250
-
1251
1239
## 8. Hugging Face Hub vs. PostgresML as Model Repository
1252
1240
We utilize the Hugging Face Hub as the primary repository for fine-tuning Large Language Models (LLMs). Leveraging the HF hub offers several advantages:
Copy file name to clipboardExpand all lines: pgml-apps/pgml-chat/README.md
-5Lines changed: 0 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,6 @@ Before you begin, make sure you have the following:
14
14
- Python version >=3.8
15
15
- (Optional) OpenAI API key
16
16
17
-
18
17
# Getting started
19
18
1. Create a virtual environment and install `pgml-chat` using `pip`:
20
19
```bash
@@ -104,7 +103,6 @@ model performance, as well as integrated notebooks for rapid iteration. Postgres
104
103
If you have any further questions or need more information, please feel free to send an email to team@postgresml.org or join the PostgresML Discord community at https://discord.gg/DmyJP3qJ7U.
105
104
```
106
105
107
-
108
106
### Slack
109
107
110
108
**Setup**
@@ -128,7 +126,6 @@ Once the slack app is running, you can interact with the chatbot on Slack as sho
128
126
129
127

130
128
131
-
132
129
### Discord
133
130
134
131
**Setup**
@@ -194,8 +191,6 @@ pip install .
194
191
4. Check the [roadmap](#roadmap) for features that you would like to work on.
195
192
5. If you are looking for features that are not included here, please open an issue and we will add it to the roadmap.
196
193
197
-
198
-
199
194
# Roadmap
200
195
- ~~Use a collection for chat history that can be retrieved and used to generate responses.~~
201
196
- Support for file formats like rst, html, pdf, docx, etc.
0 commit comments