Labels

Apache Hadoop (3) ASP.NET (2) AWS S3 (2) Batch Script (3) BigQuery (21) BlobStorage (1) C# (3) Cloudera (1) Command (2) Data Model (3) Data Science (1) Django (1) Docker (1) ETL (7) Google Cloud (5) GPG (2) Hadoop (2) Hive (3) Luigi (1) MDX (21) Mongo (3) MYSQL (3) Pandas (1) Pentaho Data Integration (5) PentahoAdmin (13) Polybase (1) Postgres (1) PPS 2007 (2) Python (13) R Program (1) Redshift (3) SQL 2016 (2) SQL Error Fix (18) SQL Performance (1) SQL2012 (7) SQOOP (1) SSAS (20) SSH (1) SSIS (42) SSRS (17) T-SQL (75) Talend (3) Vagrant (1) Virtual Machine (2) WinSCP (1)

Thursday, July 11, 2019

Polybase Configurationin SQL Server 2016

Below are the steps to configure Polybase environment in SQL Server 2016, helps in reading files in BlobStorage with SQL queries.

STEP 1: CHECK for existing configuration

EXEC sp_configure; 

STEP 2: CHECK for HADOOP Connectivity, if value = 0

EXEC sp_configure @configname='hadoop connectivity';

STEP 3: Configure external tables to reference data on Hortonworks 2.1, 2.2, and 2.3 on Linux, and Azure blob storage

sp_configure @configname = 'hadoop connectivity', @configvalue = 7;  
GO  
  
RECONFIGURE  
GO

STEP 4: Install JAVA and add JAVA_HOME environment variable

CONNECT TO SQL SERVER

STEP 1: Create a master key on the database. This is required to encrypt the credential secret.

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '**********';  

STEP 2: Create a database scoped credential for Azure blob storage.

-- IDENTITY: any string (this is not used for authentication to Azure storage).  
-- SECRET: your Azure storage account key. 

CREATE DATABASE SCOPED CREDENTIAL GeAzureBlobStorageCredential
WITH IDENTITY = 'mypolybase', Secret = '******************';

STEP 3: Create an external data source with CREATE EXTERNAL DATA SOURCE..

-- LOCATION:  Azure account storage account name and blob container name.  
-- CREDENTIAL: The database scoped credential created above.  
CREATE EXTERNAL DATA SOURCE MyAzureStorage with (  
      TYPE = HADOOP,
      LOCATION ='wasbs://polybase@getestdatawarehouse.blob.core.windows.net',  
      CREDENTIAL = GeAzureBlobStorageCredential  
);  

STEP 4: Create an external file format with CREATE EXTERNAL FILE FORMAT.

-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT,  RCFILE, ORC, PARQUET).
  
CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);
GO

STEP 5: Create an external table.

CREATE EXTERNAL TABLE asb.my_Table(
  [UserID] [varchar](50)  NULL,
  [EndDate] [varchar](50)  NULL,
  [StartDate] [varchar](50)  NULL,
  [ModifiedDate][varchar](50)  NULL
) WITH (LOCATION='/my_blob_folder/',
      DATA_SOURCE = MyAzureStorage,  
      FILE_FORMAT = TextFileFormat ,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);

Wednesday, October 10, 2018

Bigquery - Querying Date Sharded Table Using Legacy and Standard SQL

FOR DATE SHARDED TABLE

#legacySQL
SELECT *
FROM TABLE_DATE_RANGE([Project:Dataset.Table_], DATE_ADD(CURRENT_TIMESTAMP(), -3, 'DAY'),
CURRENT_TIMESTAMP())
LIMIT 1000


#standardSQL
SELECT *
FROM `Project.Dataset.Table_*`
WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", DATE_ADD(CURRENT_DATE(), INTERVAL -3 day))
AND FORMAT_DATE("%Y%m%d", CURRENT_DATE())
LIMIT 1000

Bigquery - Querying Day Partioned Table Using Legacy and Standard SQL

FOR PARTITIONED TABLE

#legacySQL
SELECT *
FROM [Project:Dataset.Table]
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(DATE_ADD(CURRENT_TIMESTAMP(), -3, 'Day'))  AND TIMESTAMP(CURRENT_TIMESTAMP())
LIMIT 1000


#standardSQL
SELECT *
FROM `Project.Dataset.Table`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(DATE_ADD(CURRENT_DATE(), INTERVAL -3 DAY)) AND CURRENT_TIMESTAMP()
LIMIT 1000



Saturday, September 29, 2018

Bigquery - SQL for Flattening Custom Metrics Value

Google Analytics stream data into bigquery in a nested json format, it make sometimes difficult for the users to flatten custom metrics data for each event, this can be overcome by using below custom dimension temp function (Standard SQL only). We can pass customMetrics.index and customMetrics.value as parameter for temp function.

CREATE TEMP FUNCTION
  customMetricByIndex(indx INT64,
    arr ARRAY<STRUCT<index INT64,
    value INT64>>) AS ( (
    SELECT
      x.value
    FROM
      UNNEST(arr) x
    WHERE
      indx=x.index) );

    SELECT visitStarttime, visitId, visitNumber,
    hit.hitNumber AS session_hit_count,
    hit.type AS hit_type,
    hit.page.hostname url_domain_name,
    hit.page,
    customMetricByIndex(3,  hit.customMetrics) AS custom_metrics_1
  FROM
    `project.dataset.ga_sessions_20180909`,
    UNNEST(hits) AS hit
    

Bigquery - SQL for Flattening Custom Dimensions Value

Google Analytics stream data into bigquery in a nested json format, it make sometimes difficult for the users to flatten custom dimension data for each event, this can be overcome by using below custom dimension temp function (Standard SQL only). We can pass customDimensions.index and customDimensions.value as parameter for temp function.

CREATE TEMP FUNCTION
  customDimensionByIndex(indx INT64,
    arr ARRAY<STRUCT<index INT64,
    value STRING>>) AS ( (
    SELECT
      x.value
    FROM
      UNNEST(arr) x
    WHERE
      indx=x.index) );

    SELECT visitStarttime, visitId, visitNumber,
    hit.hitNumber AS session_hit_count,
    hit.type AS hit_type,
    hit.page.hostname url_domain_name,
    hit.page,
    customDimensionByIndex(165,  hit.customDimensions) AS custom_variable_1
  FROM
    `project.dataset.ga_sessions_20180909`,
    UNNEST(hits) AS hit

Bigquery Views for Google Analytics Realtime Session - Standard SQL


People who started using Google Analytics real-time streaming into bigquery may come across a query conflict while calling ga_realtime_sessions table with data range filter condition, e.g.,

when we execute the below query

SELECT * FROM 
TABLE_DATE_RANGE([project:dataset.ga_realtime_sessions_], CURRENT_TIMESTAMP(),CURRENT_TIMESTAMP()) 
LIMIT 1000

We end up with error message
Query Failed
Error: Cannot output multiple independently repeated fields at the same time.
The reason is because of both real-time table and views have same naming pattern

Realtime Table: project:dataset.ga_realtime_sessions_20180929
Realtime View: project:dataset.ga_realtime_sessions_view_20180929

In addition, the real-time view is available in Legacy SQL, so we cannot use it for Standard SQL queries, to overcome this it is good to save below query as view to get realtime data for today.

SELECT  * FROM
  `project.dataset.ga_realtime_sessions_2*`
WHERE
  CONCAT('2', CAST(_TABLE_SUFFIX AS string)) = FORMAT_DATE("%Y%m%d", CURRENT_DATE())
  AND exportKey IN (
  SELECT
    exportKey
  FROM (
    SELECT
      exportKey,
      exportTimeUsec,
      MAX(exportTimeUsec) OVER (PARTITION BY visitKey) AS maxexportTimeUsec
    FROM
      `project.dataset.ga_realtime_sessions_2*`
    WHERE
      CONCAT('2', CAST(_TABLE_SUFFIX AS string)) = FORMAT_DATE("%Y%m%d", CURRENT_DATE()))
  WHERE 
    exportTimeUsec >= maxexportTimeUsec)

Monday, August 13, 2018

There was a problem confirming the ssl certificate : [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:645)

PIP Issue while installing psycopg2:

I created a virtual environment and tried to install psycopg2, but ended with the following error message:

There was a problem confirming the ssl certificate
: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:645)

To overcome this issue we need to follow the below steps:

1. Check which python & its ssl version
 

python -c "import ssl; print(ssl.OPENSSL_VERSION)"
OpenSSL 1.0.2f 28 Jan 2016


python3 -c "import ssl; print (ssl.OPENSSL_VERSION)"
OpenSSL 0.9.8zh 14 Jan 2016



2. Check pip
pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-

packages (python 3.5)

3. Upgrade pip for python3
 

curl https://bootstrap.pypa.io/get-pip.py | python3
pip --version
pip 10.0.1 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pip (python 3.5)

which pip
/Library/Frameworks/Python.framework/Versions/3.5/bin/pip


This solved the issue installing psycopg2.