Using PostgreSQL for Data Science

MacOS Installation and Configuration

Login to MacOS as user with Administrative rights
Install: brew install postres
Login to PostgreSQL as a admin user: sudo -u user_name sql db_name
Note: PostgreSQL and MacOS admin users are different and independent.
Creating a db_user: CREATE ROLE user_name WITH LOGIN PASSWORD 'your secret here’;
Changing permissions: ALTER ROLE user_name CREATEDB;
Connecting to a server running as a regular user
pgAdmin > right-click server > Create > Sever
- Main tab: Name it
- Connection tab: Specify IP address (e.g. 127.0.0.1) > admin db-user role

Invoke psql command shell

psql -d <database> -U <user> -W <password>
Note: Add -E to echo queries for learning

Command	What it does
`\l`	list databases
`\c <database>`	connect to a database
`\dt`	list tables
`\dt .`	list selected tables
`\d <table>`	show table definition
`\dv`	list views
`\dn`	list schemas
`\df`	list functions
`\df+ <function>`	show SQL of function
`\h`	psql help
`\? <psql cmd>`	psql command help
`\h <sql>`	sql command help
`\i <script>`	runs the sql script file
`\q`	quit psql
`q`	quit current task
`\x`	better looking output

Troubleshooting

Finding all postgres processes: ps aux | grep postgres
Finding process using a port: lsof -i tcp:5432
Finding process using a port: netstat -vanp tÂcp | grep 5432
Finding all running processes: ps -A

Starting and Stopping the Server on MacOS

start and stop the server: su - <admin_user> #
create aliases to manage the server storing them in ~/.bash_profile
- alias pgq='pg_ctl -D /usr/local/var/postgres stop -s -m fast'
- alias pgs='pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start'
Use psql to work with PostgreSQL on the command line or the pgAdmin4 GUI application (available as a homebrew cask)

Creating a tables

Using pgAdmin
Database: Right-click on Databases > Create > Database
Double-click on database > Double click on Schemes > Right-click on tables > Create
Columns: Right-click on table > Create > Column > Give it a name > Give it a type
Constraint: Right-click on table > Properties > Constraints tab > Add an constraint > Edit icon > Name it > Select Definition tab > Select column
Using psql command line

CREATE TABLE animals
(
    species character varying (25),
    vertebrate_class character varying(25),
    appearance character varying(25),
    num_legs int4,
    CONSTRAINT animal_pkey PRIMARY KEY (species)
);

Import data

pgadmin: right-click table > Import / Export > Define location, delimiter, header … > OK
Check by showing the data

SELECT *
FROM pets;

Statistics

Limit number of results:

SELECT *
FROM staff
LIMIT 10;

Counting and grouping results

SELECT gender, count(*)
FROM staff
GROUP BY gender;

Finding the max and min

SELECT max(salary), min(salary)
FROM staff;

Apply that across all departments

SELECT department, max(salary), min(salary)
FROM staff
GROUP BY department;

Aggregate and round values

SELECT department, round(sum(salary), 0), round(avg(salary), 0), round(var_pop(salary), 0), round(stddev_pop(salary), 0)
FROM staff
GROUP BY department;

Truncating, ceil and rounding

SELECT department, avg(salary), trunc(avg(salary),2 ), ceil(avg(salary)), round(avg(salary), 2)
FROM staff
GROUP BY department;

For more information see: PostrgreSQL Aggregation Functions Documentation

Classification

Use CASE to show pet names and a column to indicate whether the pet's name is long or short (a long name is strictly more than 6 characters long). Filter to select only female pets.

SELECT name ,
    CASE WHEN length(name) > 6
    THEN 'long'
    ELSE 'short' END
FROM pets
WHERE gender = 'female';

Filtering

Find data that meets a condition

SELECT last_name, department, salary
FROM staff
WHERE salary > 100000;

Adding multiple conditions to the WHERE clause

SELECT last_name, department, salary
FROM staff
WHERE department = 'Tools' AND salary > 100000;

Wild card in where clause

SELECT last_name, department, salary
FROM staff
WHERE department
LIKE 'B%';

Combining filters and aggregates

SELECT department, sum(salary)
FROM staff WHERE department
LIKE 'B%'
GROUP BY department;
-- NOTE:  Slow queries result from putting ‘&’ first as scans every row ignoring the index

Find distinct values

SELECT DISTINCT LOWER(department)
FROM staff;

Munging

Concatenating values

SELECT job_title || '-' || department
FROM staff;

… and create a new column

SELECT job_title || '-' || department title_dept
FROM staff;

Trim and length functions

SELECT length(' software engineer '), length(trim(' software engineer '));

Create new boolean field

SELECT job_title, (job_title LIKE '%Assistant%') is_assist
FROM staff;

Extracting substring of length 3

SELECT SUBSTRING('abcdefghijkl'
FROM 6 FOR 3) test_string;

Extracting remainder of a string

SELECT SUBSTRING('abcdefghijkl' FOR 4) test_string;

Replacing text

SELECT OVERLAY(job_title PLACING 'Asst.'
FROM 1 FOR 9)
FROM staff
WHERE job_title
LIKE 'Assistant%';

Regex

Find assistants at specified levels

SELECT job_title
FROM staff
WHERE job_title
SIMILAR TO '%Assistant%(III|IV)';

… now match levels starting with I

SELECT job_title
FROM staff
HERE job_title
SIMILAR TO '%Assistant% I_';

… or job title that start with E, P or S

SELECT job_title
FROM staff
WHERE job_title
SIMILAR TO '[EPS]%';

Sub-Queries

Consider using a view and or a windowing functions for simpler way to implement the same functionality.

Adding a calculated field

SELECT s1.last_name, s1.salary, s1.department,
    (SELECT round(avg(salary))
    FROM staff s2
    WHERE s2.department = s1.department )
FROM staff s1;

Using a sub-query instead of a table name

SELECT s1.department, round(avg(s1.salary))
FROM
    (SELECT department, salary
    FROM staff
    WHERE salary > 100000) s1
GROUP BY department;

Using a sub-query to find max salary

SELECT s1.department, s1.last_name, s1.salary
FROM staff s1
WHERE s1.salary = (SELECT max(s2.salary)
FROM staff s2);

For more information see: PostgreSQL SubQuery Expressions Documentation

Joining Tables

Join two tables using department name

SELECT s.last_name, s.department, cd.company_division
FROM staff
JOIN company_divisions cd
ON s.department = cd.department;
    -- Note:  it is missing rows because not all departments
    -- in the staff table are found in the company_divisions table,
    -- so we need to prefer the division from the staff table.
    -- This is called a left outer join.

Left outer-join gets all rows

SELECT s.last_name, s.department, cd.company_division
FROM staff s
LEFT JOIN company_divisions cd
ON s.department = cd.department;

… find departments without a division

SELECT s.last_name, s.department, cd.company_division
FROM staff s
LEFT JOIN company_divisions cd
ON s.department = cd.department
WHERE cd.company_division IS NULL;
    -- Note:  Turns out it is the Books department

Using Views

Save a long join by creating a view

CREATE VIEW staff_div_reg AS
    SELECT s.*, cd.company_division, cr.company_regions
    FROM staff s
    LEFT JOIN company_divisions cd
    ON s.department = cd.department
    LEFT JOIN company_regions cr
    ON s.region_id = cr.region_id;

Check that all rows are present in the view

SELECT count(*)
FROM staff_div_reg;

Query to be stored as a view

SELECT company_regions, count(*)
FROM staff_div_reg
GROUP BY company_regions
ORDER BY company_regions;

Use view to provide counts of staff in each region

CREATE OR REPLACE VIEW staff_div_reg_country
AS
    SELECT s.*, cd.company_division, cr.company_regions, cr.country
    FROM staff s, company_divisions cd, company_regions cr;

Create a view that joins user and sales tables

CREATE VIEW sales_by_person
AS
    SELECT s.name, s.id, o.amount
    FROM salesperson s
    LEFT JOIN orders o
    ON s.id = o.salesperson_id;
        -- Note: Every time the view is used
        --       its SQL is re-evaluated.

Create a materialized view that is stored on the server

CREATE MATERIALIZED VIEW sales_by_person
AS
    SELECT s.name, s.id, o.amount
    FROM salesperson s
    LEFT JOIN orders o
    ON s.id = o.salesperson_id;
        -- Note: Once created materialized views
        --       can be queried many times but are
        --       not automatically updated.

Updating a materialized view

REFRESH MATERIALIZED VIEW sales_by_person;

Report number and amount of successful sales by name

SELECT name, count(amount), sum(amount)
FROM sales_by_person
GROUP BY name
HAVING count(amount) > 1
ORDER BY name;

For more information see: * PostgreSQL Views and the Rule System Documentation * PostgreSQL Materialized Views Documentation

Temporary Tables

Temporary tables are alternative to Views and Subqueries. Here are two ways to create a temporary table in SQL.

Create and drop a temporary table

-- create a temporary table
CREATE TABLE temp_table AS
(
    SELECT col1, col2, col3
    FROM another_table;
)

-- select the data you need
SELECT tt.col1, tt.col2
FROM temp_table AS tt;

-- delete the temporary table from the database
DROP TABLE temp_table;

True temporary table created using WITH

-- create the temporary table
WITH temp_table AS
(
    SELECT col1, col2, col3
    FROM another_table
);

-- select the data you need
SELECT tt.col1, tt.col2
FROM temp_table AS tt;

Grouping & Totaling

Breakout sales number and amount of sales by name

SELECT s.name, count(o.amount), sum(o.amount)
FROM salesperson s
LEFT JOIN orders o
ON s.id = o.salesperson_id
GROUP BY s.name
HAVING count(o.amount) > 1
ORDER BY s.name;

Breakout the total number of employees by division and region using grouping set sql SELECT company_division, company_regions, count(*) FROM staff_div_reg GROUP BY GROUPING SETS (company_division, company_regions) ORDER BY company_regions, company_division; -- Note: Cells are blank when sub-totaled by another quantity

Create subtotals using rollup

SELECT company_regions, country, count(*)
FROM staff_div_reg_country
GROUP BY ROLLUP(country, company_regions)
ORDER BY country, company_regions;

Create all possible subtotals using cube

SELECT company_division, company_regions, count(*)
FROM staff_div_reg_country
GROUP BY CUBE(company_division, company_regions);

For more information see: PostgreSQL Grouping, Cube and Rollup Documentation

Sorting / Ordering

Find the top N values

SELECT last_name, job_title, salary
FROM staff
ORDER BY salary DESC
FETCH FIRST 10 ROWS ONLY;
    -- Note:  LIMIT get the first N elements, FETCH scans all
    --        the rows and get the highest N values.

Ascending ordering by count

SELECT company_division, count(*)
FROM staff_div_reg_country
GROUP BY company_division
ORDER BY count(*);

Descending order by count

SELECT company_division, count(*)
FROM staff_div_reg_country
GROUP BY company_division
ORDER BY count(*) DESC;

Limit to first 3 rows

SELECT company_division, count(*)
FROM staff_div_reg_country GROUP BY company_division
ORDER BY count(*) DESC
FETCH FIRST 3 ROWS ONLY;

Sorting by date

SELECT facility_id, bed_census_date
FROM beds
WHERE facility_id = '6057'
ORDER BY bed_census_date::timestamp ASC;

Finding the largest totals by sorting an aggregated field

SELECT facility_name, facility_id, bed_census_date, sum(available_residential_beds::integer)
FROM beds
GROUP BY facility_name, facility_id, bed_census_date
ORDER BY sum(available_residential_beds::integer) DESC
FETCH FIRST 10 ROWS ONLY;

For more information see: PostgreSQL Sorting Documentation

Window Functions

Window functions are simpler than subqueries and produce similar results they operate rows adjacent to the current row.

SELECT company_regions, last_name, salary, min(salary)
OVER
    (
        PARTITION BY company_regions
    )
FROM staff_div_reg;

Employees by department sorted by salary

SELECT department, last_name, salary, first_value(salary)
OVER
    (
        PARTITION BY department
        ORDER BY salary DESC
    )
FROM staff;

Employees by department sorted by last_name

SELECT department, last_name, salary, first_value(salary)
OVER
    (
        PARTITION BY department
        ORDER BY last_name
    )
FROM staff;

Ranking employees within a department by salary

SELECT department, last_name, salary, rank()
OVER
(
    PARTITION BY department
    ORDER BY salary DESC
)
FROM staff;

Lag function references previous row

SELECT department, last_name, salary, lag(salary)
OVER
    (
        PARTITION BY department
        ORDER BY salary DESC
    )
FROM staff;

Lead function references the next row

SELECT department, last_name, salary, lead(salary)
OVER
(
    PARTITION BY department
    ORDER BY salary DESC
)
FROM staff;

Calculating salary quartiles using ntile

SELECT department, last_name, salary, ntile(4)
OVER
(
    PARTITION BY department
    ORDER BY salary DESC
)
FROM staff;

For more information see: PostgreSQL Windowing Function Expressions Documentation

Telling Stories with Data

Start with a problem
- Losing customers
- Product sales decreasing
Gather data it will take time, figure in more time when using different data sources to make them all consistently encoded
Understand your data using statistics
Reformat and check it before attempting to do joins
Use views to capture complex SQL statements
Use joins
- Inner joins only return row that both table share
- Use an outer join if you want all rows from one table even if they are missing from the other table
For cross tabulations use cubes, rollups, and grouping sets rather than subqueries
Use window functions to work on sets of related rows

Resources

Tutorial: https://www.tutorialspoint.com/postgresql/index.htm
Tutorial: [http://www.postgresqltutorial.com/(http://www.postgresqltutorial.com/)]
PostrgreSQL Aggregation Functions: https://www.postgresql.org/docs/current/static/functions-aggregate.html
Joins explained visually: https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
SQL for Data Science from Lynda.com: https://www.lynda.com/SQL-tutorials/SQL-Tips-Tricks-Data-Science/558576-2.html. I highly recommend this course. Increasingly Lynda.com's excellent courses are available through your local library. Library patrons in Denver can access this course through Denver Public Library Teach Yourself Technology page
SQL for Data Science from Lynda.com: https://www.lynda.com/SQL-tutorials/SQL-Tips-Tricks-Data-Science/558576-2.html. This is an excellent course and increasingly Lynda's courses are available through your local library. Here in Denver library patrons can access all of Lynda's courses through the Denver Public Library Teach Yourself Technology page.