-
Notifications
You must be signed in to change notification settings - Fork 351
Data docs #1418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data docs #1418
Changes from 3 commits
eced51a
a5abec1
0c51475
f289956
cd61aaa
17a1072
941bed3
1cf3c81
1cc71ed
4ce5467
4d7f7c2
7f6e676
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| .terraform | ||
| *.lock.hcl | ||
| *.tfstate | ||
| *.tfstate.backup |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Terraform configuration for pgml-rds-proxy on EC2 | ||
|
|
||
| This is a sample Terraform deployment for running pgml-rds-proxy on EC2. This will spin up an EC2 instance | ||
| with a public IP and a working security group & install the community Docker runtime. | ||
|
|
||
| Once the instance is running, you can connect to it using the root key and run the pgml-rds-proxy Docker container | ||
| with the correct PostgresML `DATABASE_URL`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| terraform { | ||
| required_providers { | ||
| aws = { | ||
| source = "hashicorp/aws" | ||
| version = "~> 5.46" | ||
| } | ||
| } | ||
|
|
||
| required_version = ">= 1.2.0" | ||
| } | ||
|
|
||
| provider "aws" { | ||
| region = "us-west-2" | ||
| } | ||
|
|
||
| data "aws_ami" "ubuntu" { | ||
| most_recent = true | ||
|
|
||
| filter { | ||
| name = "name" | ||
| values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"] | ||
| } | ||
|
|
||
| filter { | ||
| name = "virtualization-type" | ||
| values = ["hvm"] | ||
| } | ||
|
|
||
| owners = ["099720109477"] # Canonical | ||
| } | ||
|
|
||
| resource "aws_security_group" "pgml-rds-proxy" { | ||
| egress { | ||
| from_port = 0 | ||
| to_port = 0 | ||
| protocol = "-1" | ||
| cidr_blocks = ["0.0.0.0/0"] | ||
| ipv6_cidr_blocks = ["::/0"] | ||
| } | ||
|
|
||
| ingress { | ||
| from_port = 6432 | ||
| to_port = 6432 | ||
| protocol = "tcp" | ||
| cidr_blocks = ["0.0.0.0/0"] | ||
| ipv6_cidr_blocks = ["::/0"] | ||
| } | ||
|
|
||
| ingress { | ||
| from_port = 22 | ||
| to_port = 22 | ||
| protocol = "tcp" | ||
| cidr_blocks = ["0.0.0.0/0"] | ||
| ipv6_cidr_blocks = ["::/0"] | ||
| } | ||
| } | ||
|
|
||
| resource "aws_instance" "pgml-rds-proxy" { | ||
| ami = data.aws_ami.ubuntu.id | ||
| instance_type = "t3.micro" | ||
| key_name = var.root_key | ||
|
|
||
| root_block_device { | ||
| volume_size = 30 | ||
| delete_on_termination = true | ||
| } | ||
|
|
||
| vpc_security_group_ids = [ | ||
| "${aws_security_group.pgml-rds-proxy.id}", | ||
| ] | ||
|
|
||
| associate_public_ip_address = true | ||
| user_data = file("${path.module}/user_data.sh") | ||
| user_data_replace_on_change = false | ||
|
|
||
| tags = { | ||
| Name = "pgml-rds-proxy" | ||
| } | ||
| } | ||
|
|
||
| variable "root_key" { | ||
| type = string | ||
| description = "The name of the SSH Root Key you'd like to assign to this EC2 instance. Make sure it's a key you have access to." | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| #!/bin/bash | ||
| # | ||
| # Cloud init script to install Docker on an EC2 instance running Ubuntu 22.04. | ||
| # | ||
|
|
||
| sudo apt-get update | ||
| sudo apt-get install ca-certificates curl | ||
| sudo install -m 0755 -d /etc/apt/keyrings | ||
| sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc | ||
| sudo chmod a+r /etc/apt/keyrings/docker.asc | ||
|
|
||
| # Add the repository to Apt sources: | ||
| echo \ | ||
| "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ | ||
| $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ | ||
| sudo tee /etc/apt/sources.list.d/docker.list > /dev/null | ||
| sudo apt-get update | ||
|
|
||
| sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin | ||
| sudo groupadd docker | ||
| sudo usermod -aG docker ubuntu |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| *.md.bak |
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -7,8 +7,9 @@ | |||
| * [Create your database](introduction/getting-started/create-your-database.md) | ||||
| * [Connect your app](introduction/getting-started/connect-your-app.md) | ||||
| * [Import your data](introduction/getting-started/import-your-data/README.md) | ||||
| * [CSV](introduction/getting-started/import-your-data/csv.md) | ||||
| * [Foreign Data Wrapper](introduction/getting-started/import-your-data/foreign-data-wrapper.md) | ||||
| * [Logical replication](introduction/getting-started/import-your-data/logical-replication/README.md) | ||||
| * [Foreign Data Wrappers](introduction/getting-started/import-your-data/foreign-data-wrappers.md) | ||||
| * [COPY](introduction/getting-started/import-your-data/copy.md) | ||||
|
|
||||
| ## API | ||||
|
|
||||
|
|
@@ -50,7 +51,7 @@ | |||
|
|
||||
| ## Product | ||||
|
|
||||
| * [Cloud Database](product/cloud-database/README.md) | ||||
| * [AI Database](product/cloud-database/README.md) | ||||
| * [Serverless databases](product/cloud-database/serverless-databases.md) | ||||
| * [Dedicated](product/cloud-database/dedicated.md) | ||||
| * [Enterprise](product/cloud-database/plans.md) | ||||
|
|
@@ -79,7 +80,7 @@ | |||
| ## Resources | ||||
|
|
||||
| * [FAQs](resources/faqs.md) | ||||
| * [Data Storage & Retrieval](resources/data-storage-and-retrieval/README.md) | ||||
| * [Data Storage & Retrieval](resources/data-storage-and-retrieval/tabular-data.md) | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a dup?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a dup, README.md is currently empty. |
||||
| * [Tabular data](resources/data-storage-and-retrieval/tabular-data.md) | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
| * [Documents](resources/data-storage-and-retrieval/documents.md) | ||||
| * [Partitioning](resources/data-storage-and-retrieval/partitioning.md) | ||||
|
|
||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -4,11 +4,11 @@ description: Setup a database and connect your application to PostgresML | |||||
|
|
||||||
| # Getting Started | ||||||
|
|
||||||
| A PostgresML deployment consists of multiple components working in concert to provide a complete Machine Learning platform. We provide a fully managed solution in our cloud. | ||||||
| A PostgresML deployment consists of multiple components working in concert to provide a complete Machine Learning platform. We provide a fully managed solution in our cloud, and document a self-hosted installation in our docs. | ||||||
|
||||||
|
|
||||||
| * A PostgreSQL database, with pgml and pgvector extensions installed, including backups, metrics, logs, replicas and high availability configurations | ||||||
| * A PgCat pooling proxy to provide secure access and model load balancing across tens of thousands of clients | ||||||
| * A web application to manage deployed models and host SQL notebooks | ||||||
| * PostgreSQL database, with `pgml`, `pgvector` and many other extensions installed, including backups, metrics, logs, replicas and high availability | ||||||
| * PgCat pooler to provide secure access and model load balancing across thousands of clients | ||||||
| * A web application to manage deployed models and write experiments in SQL notebooks | ||||||
|
||||||
| * A web application to manage deployed models and write experiments in SQL notebooks | |
| * A web application to manage deployed models and share experiments and analysis in SQL notebooks |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -4,13 +4,13 @@ description: PostgresML is compatible with all standard PostgreSQL clients | |||||
|
|
||||||
| # Connect your app | ||||||
|
|
||||||
| You can connect to your database from any Postgres compatible client. PostgresML is intended to serve in the traditional role of an application database, along with it's extended role as an MLOps platform to make it easy to build and maintain AI applications. | ||||||
| You can connect to your database from any PostgresSQL-compatible client. PostgresML is intended to serve in the traditional role of an application database, along with it's extended role as an MLOps platform to make it easy to build and maintain AI applications together with your application data. | ||||||
|
|
||||||
| ## Application SDKs | ||||||
| ## SDK | ||||||
|
||||||
| ## SDK | |
| ## Client SDKs |
We've waffled between singular and plural. If we want to change to singular, we'll need to update in several other places, including urls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to write Client SDKs ever again, that wording is just wrong. SDK are only usable on the client by definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be singular going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New PRs should use the existing style guide, for code or copy. We should make that update globally in a separate PR to preserve consistency.
Dealing with the separate issue, many users frequently ask about the difference between the SDK and Extension, and how they interact. I think you're confused about the difference between and SDK and API as commonly used, but those terms are not always 100% consistently used in industry, so we need some additional education about how we're using them.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We provide a client SDK for JavaScript, Python and Rust. The SDK manages connections to the Postgres database and makes it easy to construct efficient queries for AI use cases, like managing a document collection for RAG, or building a chatbot. All of the ML & AI still happenening inside the database, with centralized operations, hardware and dependency management. | |
| We provide a client SDK for JavaScript, Python and Rust. The SDK manages connections to the Postgres database and makes it easy to construct efficient queries for AI use cases, like managing a document collection for RAG, or building a chatbot. All of the ML & AI still happens inside the database, with centralized operations, hardware and dependency management. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The SDK are under rapid development to add new features and use cases, but we release non breaking changes with minor version updates in accordance with SemVer. It's easy to install into your existing application. | |
| The SDKs are under rapid development to add new features and use cases, but we release non breaking changes with minor version updates in accordance with SemVer. It's easy to install into your existing application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Our SDK comes with zero additional dependencies. The core of the SDK is written in Rust, and we provide language bindings and native packaging & distribution. | |
| Our SDK comes with zero additional dependencies, to provide the simplest and safest ML application deployment and maintenance possible. The core of the SDK is written in Rust, and we provide language bindings and native packaging & distribution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If you need to write ad-hoc queries, you can use any of these popular tools to execute SQL queries directly on your database: | |
| If you need to write ad-hoc queries, or perform administrative functions, you can use any of these popular tools to execute SQL queries directly on your database: |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,22 +1,26 @@ | ||||||
| # Import your data | ||||||
|
|
||||||
| Machine learning always depends on input data, whether it's generating text with pretrained LLMs, training a retention model on customer data, or predicting session abandonment in real time. Just like any PostgreSQL database, PostgresML can be configured as the authoritative application data store, a streaming replica from some other primary, or use foreign data wrappers to query another data host on demand. Depending on how frequently your data changes and where your authoritative data resides, different methodologies imply different tradeoffs. | ||||||
| AI needs data, whether it's generating text with LLMs, creating embeddings, or training regression or classification models on customer data. | ||||||
|
|
||||||
| PostgresML can easily ingest data from your existing data stores. | ||||||
| Just like any PostgreSQL database, PostgresML can be configured as the primary application database, a logical replica of your primary database, or with foreign data wrappers to query your primary database on demand. Depending on how frequently your data changes and your latency requirements, one approach is better than the other. | ||||||
|
|
||||||
| ## Static data | ||||||
| ## Primary database | ||||||
|
|
||||||
| Data that changes infrequently can be easily imported into PostgresML using `COPY`. All you have to do is export your data as a CSV file, create a table in Postgres to store it, and import it using the command line. | ||||||
| If you're intention is to use PostgresML as your primary database, your job here is done. You can use the connection credentials provided and start building your application on top of in-database AI right away. | ||||||
|
||||||
| If you're intention is to use PostgresML as your primary database, your job here is done. You can use the connection credentials provided and start building your application on top of in-database AI right away. | |
| If your intention is to use PostgresML as your primary database, your job here is done. You can use the connection credentials provided and start building your application on top of in-database AI right away. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # COPY | ||
|
|
||
| Data that changes infrequently can be easily imported into PostgresML (and any other Postgres database) using `COPY`. All you have to do is export your data as a file, create a table in Postgres to store it, and import it using the command line (or your IDE of choice). | ||
|
|
||
| ## Getting started | ||
|
|
||
| We'll be using CSV as our data format of choice. CSV is a supported mechanism for data transport in pretty much every database and system in existence, so you won't have any trouble finding the CSV export functionality in your current data store. | ||
|
|
||
| Let's use a simple CSV file with 3 columns as an example: | ||
|
|
||
| | Column | Data type | Example data | | ||
| | ---------------- | --------- | ------- | | ||
| | name | text | John | | ||
| | age | integer | 30 | | ||
| | is\_paying\_user | boolean | true | | ||
|
|
||
| ### Export data | ||
|
|
||
| If you're using a Postgres database already, you can export any table as CSV with just one command: | ||
|
|
||
| ```bash | ||
| psql \ | ||
| postgres://user:password@your-production-db.amazonaws.com \ | ||
| -c "\copy (SELECT * FROM users) TO '~/users.csv' CSV HEADER" | ||
| ``` | ||
|
|
||
| If you're using another data store, it will almost always provide a CSV export functionality. | ||
|
|
||
| ### Create table in PostgresML | ||
|
|
||
| Create a table in PostgresML with the correct schema: | ||
|
|
||
|
|
||
| {% tabs %} | ||
| {% tab title="SQL" %} | ||
|
|
||
| ```postgresql | ||
| CREATE TABLE users( | ||
| name TEXT, | ||
| age INTEGER, | ||
| is_paying_user BOOLEAN | ||
| ); | ||
| ``` | ||
|
|
||
| {% endtab %} | ||
| {% tab title="Output" %} | ||
|
|
||
| ``` | ||
| CREATE TABLE | ||
| ``` | ||
|
|
||
| {% endtab %} | ||
| {% endtabs %} | ||
|
|
||
| Data types should roughly match to what you have in your CSV file. If the data type is not known, you can always use `TEXT` and figure out what it is later with a few queries. Postgres also supports converting data types, as long as they are formatted correctly. | ||
|
|
||
| ### Import data | ||
|
|
||
| Once you have a table and your data exported as CSV, importing it can also be done with just one command: | ||
|
|
||
| ```bash | ||
| psql \ | ||
| postgres://user:password@sql.cloud.postgresml.org/your_pgml_database \ | ||
| -c "\copy your_table FROM '~/your_table.csv' CSV HEADER" | ||
| ``` | ||
|
|
||
| We took our export command and changed `TO` to `FROM`, and that's it. Make sure you're connecting to your PostgresML database when importing data. | ||
|
|
||
| ## Refresh data | ||
|
|
||
| If your data changed, repeat this process again. To avoid duplicate entries in your table, you can truncate (or delete) all rows beforehand: | ||
|
|
||
| ```sql | ||
| TRUNCATE your_table; | ||
| ``` |
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Cloud as opposed to Vector section