Skip to content

Latest commit

 

History

History
165 lines (124 loc) · 6.5 KB

README.md

File metadata and controls

165 lines (124 loc) · 6.5 KB

Summarizing large tables with SQL

Henry Luan

Overview

The core helper function get_summary_codes in the sql.mechanic package takes in a string that specifies a table’s database name, schema name, and table name. It outputs an SQL query that summarizes the table’s column statistics.

If you don’t have much time to read or just want a quick solution, jump to Example 2.

Intended User

  • If you work with tables inside databases hosted on powerful servers (usually on-premise) but have limited compute resources (e.g. RAM, GPU, CPU) for advanced BI (e.g. Python, R, SAS, etc.)

  • If it’s much cheaper for you to use database servers than analytics servers (e.g. those for Python, R, etc.) either on-premise or on the cloud.

Limitations

  • You probably want to understand how Example 2affect the CPU and disk usage of your database server, to avoid bad surprises on your server’s resource usage.

  • If you use the setup of Example 2on a cloud database, you MUST do some testing, to understand how the example affects the CPU and disk usage of your cloud resource. Please avoid potentially expensive mistakes.

  • Currently, the function only works with Microsoft SQL Server and Netezza databases. Feel free to contribute to the codes, if interested.

Credits

Specail credits: the R codes in this package are built on top of Gordon S. Linoff’s book Data Analysis Using SQL and Excel, 2nd Edition. His work has been a tremendous inspiration for the creation of this package.

Example 1: Generate SQL queries and execute them in DMBS

Step 1: Install packages

You can install the library from my GitHub. If you have concerns regarding the package’s security, you can download, check, and use the “get_summary_codes.R” file directly.

# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)

# Alternative: use the 'get_summary_codes.R' file only
  # source("get_summary_codes.R")

Step 2: Generate SQL quires

The following codes 1) generate the SQL queries you need to summarize a table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to do is paste it to your SQL editor and execute the queries.

library(sql.mechanic)

#SQL codes for basic summary, Netezza database 
    sql_basic_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")   
    
#SQL codes for advanced summary, Netezza database 
    sql_advanced_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="Netezza")   
    
#SQL codes for basic summary, Microsoft SQL Server 
    sql_basic_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="MSSQL")  

#SQL codes for advanced summary, Microsoft SQL Server
    sql_advanced_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="MSSQL")  
    
#copy some of the sql queries to the clipboard    
    writeClipboard(sql_basic_netezza) 

Step 3: Paste the codes in your clipboard and run it in SQL

In case you are curious, the SQL copied to your clipboard looks like this.

SELECT REPLACE(REPLACE(REPLACE('<start> SELECT ''<col>'' as colname,
                               COUNT(*) as numvalues,
                               MAX(freqnull) as freqnull,
                               CAST(MIN(minval) as CHAR(100)) as minval,
                               SUM(CASE WHEN <col> = minval THEN freq ELSE 0 END) as numminvals,
                               CAST(MAX(maxval) as CHAR(100)) as maxval,
                               SUM(CASE WHEN <col> = maxval THEN freq ELSE 0 END) as nummaxvals,
                               SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniques

                               FROM (SELECT <col>, COUNT(*) as freq
                               FROM SCHEMA_NAME.<tab> GROUP BY <col>) osum
                                                   CROSS JOIN (SELECT MIN(<col>) as minval, MAX(<col>) as maxval, SUM(CASE WHEN <col> IS NULL THEN 1 ELSE 0 END) as freqnull
                                                   FROM (SELECT <col> FROM SCHEMA_NAME.<tab>) osum
                                                   ) summary',
                               '<col>', column_name),
                               '<tab>', 'TABLE_NAME'),
                               '<start>',
                               (CASE WHEN ordinal_position = 1 THEN ''
                               ELSE 'UNION ALL' END)) as codes_data_summary
                               FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP')  then '"'||column_name||'"'
                                                             else column_name end as column_name  , ordinal_position
                               FROM information_schema.columns
                               WHERE table_name ='TABLE_NAME') a;

Step 4: Copy, paste and execute the query results from Step 3.

Example 2: Automatically summarize tables in your databases

This example shows you how you can summarize tables as you did in Example 1, with only a few lines of R codes.

# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)

# Alternative: use the 'get_summary_codes.R' file only
  # source("get_summary_codes.R")

# Load packages  
library(sql.mechanic)
library(odbc)
library(DBI)

# Connect to database(s)
## Method 1: prompting user (you need to make some changes here)
  con <- dbConnect(odbc(),
                   Driver = "SQL Server",
                   Server = "mysqlhost",
                   Database = "mydbname",
                   UID = "myuser",
                   PWD = "Database password",
                   Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly

## Alternative Method: Using a DSN
  #con <- dbConnect(odbc::odbc(), "DNS_NAMES", encoding = 'windows-1252')

sql_query = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza") 

res = dbSendQuery(con, sql_query) # part of Step 3 in "Example 1"
sql_query_mod = dbFetch(res) # part of Step 3 in "Example 1"

res = dbSendQuery(con, sql_query_mod) # part of Step 4 in "Example 1"
output_table = dbFetch(res) # part of Step 4 in "Example 1"
print(output) #desired summary table

dbDisconnect(con) #close database connection