Labels

Apache Hadoop (3) ASP.NET (2) AWS S3 (2) Batch Script (3) BigQuery (21) BlobStorage (1) C# (3) Cloudera (1) Command (2) Data Model (3) Data Science (1) Django (1) Docker (1) ETL (7) Google Cloud (5) GPG (2) Hadoop (2) Hive (3) Luigi (1) MDX (21) Mongo (3) MYSQL (3) Pandas (1) Pentaho Data Integration (5) PentahoAdmin (13) Polybase (1) Postgres (1) PPS 2007 (2) Python (13) R Program (1) Redshift (3) SQL 2016 (2) SQL Error Fix (18) SQL Performance (1) SQL2012 (7) SQOOP (1) SSAS (20) SSH (1) SSIS (42) SSRS (17) T-SQL (75) Talend (3) Vagrant (1) Virtual Machine (2) WinSCP (1)

Wednesday, October 10, 2018

Bigquery - Querying Date Sharded Table Using Legacy and Standard SQL

FOR DATE SHARDED TABLE

#legacySQL
SELECT *
FROM TABLE_DATE_RANGE([Project:Dataset.Table_], DATE_ADD(CURRENT_TIMESTAMP(), -3, 'DAY'),
CURRENT_TIMESTAMP())
LIMIT 1000


#standardSQL
SELECT *
FROM `Project.Dataset.Table_*`
WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d", DATE_ADD(CURRENT_DATE(), INTERVAL -3 day))
AND FORMAT_DATE("%Y%m%d", CURRENT_DATE())
LIMIT 1000

Bigquery - Querying Day Partioned Table Using Legacy and Standard SQL

FOR PARTITIONED TABLE

#legacySQL
SELECT *
FROM [Project:Dataset.Table]
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(DATE_ADD(CURRENT_TIMESTAMP(), -3, 'Day'))  AND TIMESTAMP(CURRENT_TIMESTAMP())
LIMIT 1000


#standardSQL
SELECT *
FROM `Project.Dataset.Table`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP(DATE_ADD(CURRENT_DATE(), INTERVAL -3 DAY)) AND CURRENT_TIMESTAMP()
LIMIT 1000



Saturday, September 29, 2018

Bigquery - SQL for Flattening Custom Metrics Value

Google Analytics stream data into bigquery in a nested json format, it make sometimes difficult for the users to flatten custom metrics data for each event, this can be overcome by using below custom dimension temp function (Standard SQL only). We can pass customMetrics.index and customMetrics.value as parameter for temp function.

CREATE TEMP FUNCTION
  customMetricByIndex(indx INT64,
    arr ARRAY<STRUCT<index INT64,
    value INT64>>) AS ( (
    SELECT
      x.value
    FROM
      UNNEST(arr) x
    WHERE
      indx=x.index) );

    SELECT visitStarttime, visitId, visitNumber,
    hit.hitNumber AS session_hit_count,
    hit.type AS hit_type,
    hit.page.hostname url_domain_name,
    hit.page,
    customMetricByIndex(3,  hit.customMetrics) AS custom_metrics_1
  FROM
    `project.dataset.ga_sessions_20180909`,
    UNNEST(hits) AS hit
    

Bigquery - SQL for Flattening Custom Dimensions Value

Google Analytics stream data into bigquery in a nested json format, it make sometimes difficult for the users to flatten custom dimension data for each event, this can be overcome by using below custom dimension temp function (Standard SQL only). We can pass customDimensions.index and customDimensions.value as parameter for temp function.

CREATE TEMP FUNCTION
  customDimensionByIndex(indx INT64,
    arr ARRAY<STRUCT<index INT64,
    value STRING>>) AS ( (
    SELECT
      x.value
    FROM
      UNNEST(arr) x
    WHERE
      indx=x.index) );

    SELECT visitStarttime, visitId, visitNumber,
    hit.hitNumber AS session_hit_count,
    hit.type AS hit_type,
    hit.page.hostname url_domain_name,
    hit.page,
    customDimensionByIndex(165,  hit.customDimensions) AS custom_variable_1
  FROM
    `project.dataset.ga_sessions_20180909`,
    UNNEST(hits) AS hit

Bigquery Views for Google Analytics Realtime Session - Standard SQL


People who started using Google Analytics real-time streaming into bigquery may come across a query conflict while calling ga_realtime_sessions table with data range filter condition, e.g.,

when we execute the below query

SELECT * FROM 
TABLE_DATE_RANGE([project:dataset.ga_realtime_sessions_], CURRENT_TIMESTAMP(),CURRENT_TIMESTAMP()) 
LIMIT 1000

We end up with error message
Query Failed
Error: Cannot output multiple independently repeated fields at the same time.
The reason is because of both real-time table and views have same naming pattern

Realtime Table: project:dataset.ga_realtime_sessions_20180929
Realtime View: project:dataset.ga_realtime_sessions_view_20180929

In addition, the real-time view is available in Legacy SQL, so we cannot use it for Standard SQL queries, to overcome this it is good to save below query as view to get realtime data for today.

SELECT  * FROM
  `project.dataset.ga_realtime_sessions_2*`
WHERE
  CONCAT('2', CAST(_TABLE_SUFFIX AS string)) = FORMAT_DATE("%Y%m%d", CURRENT_DATE())
  AND exportKey IN (
  SELECT
    exportKey
  FROM (
    SELECT
      exportKey,
      exportTimeUsec,
      MAX(exportTimeUsec) OVER (PARTITION BY visitKey) AS maxexportTimeUsec
    FROM
      `project.dataset.ga_realtime_sessions_2*`
    WHERE
      CONCAT('2', CAST(_TABLE_SUFFIX AS string)) = FORMAT_DATE("%Y%m%d", CURRENT_DATE()))
  WHERE 
    exportTimeUsec >= maxexportTimeUsec)

Monday, August 13, 2018

There was a problem confirming the ssl certificate : [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:645)

PIP Issue while installing psycopg2:

I created a virtual environment and tried to install psycopg2, but ended with the following error message:

There was a problem confirming the ssl certificate
: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:645)

To overcome this issue we need to follow the below steps:

1. Check which python & its ssl version
 

python -c "import ssl; print(ssl.OPENSSL_VERSION)"
OpenSSL 1.0.2f 28 Jan 2016


python3 -c "import ssl; print (ssl.OPENSSL_VERSION)"
OpenSSL 0.9.8zh 14 Jan 2016



2. Check pip
pip --version
pip 9.0.1 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-

packages (python 3.5)

3. Upgrade pip for python3
 

curl https://bootstrap.pypa.io/get-pip.py | python3
pip --version
pip 10.0.1 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pip (python 3.5)

which pip
/Library/Frameworks/Python.framework/Versions/3.5/bin/pip


This solved the issue installing psycopg2.

Sunday, April 29, 2018

SQOOP Import and Export Examples


Below are some sample commands to export and import operator in SQOOP to move data from relational databases (e.g., mysql is used here) to HDFS location 

Export:
sqoop export --connect jdbc:mysql://mysqldb.****.****/database --table <table_name> --username ******* -password ****** -fields-terminated-by ',' -m 1 --export-dir <HDFS Path>

Import:
sqoop import --connect jdbc:mysql://mysqldb.******.****/MyDB --table customers --username ****** --password ****** --target-dir batch/sqoop/job1 -m 1

=> m =1 loads all data to single part file.
=> m= 5 loads data to 5 separate part files.

Extracts specific columns:
sqoop import --connect jdbc:mysql://mysqldb.edu.cloudlab.com/retail_db --table customers --username labuser --password edureka --target-dir batch/sqoop/job1 --columns “column1, column2” -m 1 


Saturday, March 24, 2018

Build Docker for Logstash - Ubuntu

Below steps helps to launch a logstash machine with Doker file.

Step 1: Install Docker

>> sudo apt-get install docker-ce (for ubuntu)

Step 2: Lets create a folder docker-image

Step 3: create a file called Docker with below scripts:

FROM docker.elastic.co/logstash/logstash:6.2.2
RUN rm -f /usr/share/logstash/pipeline/logstash.conf # (optional)
RUN mkdir -p ADD /usr/share/logstash/template # (optional)
COPY your_pipeline.conf /usr/share/logstash/pipeline/your_pipeline.conf
CMD ["/usr/share/logstash/bin/logstash", "-f", "/usr/share/logstash/pipeline/your_pipeline.conf"]

Step 4: Navigate to Dockerfile location and run below command in terminal:

>> docker build -t test_logstash:v1 <docker dir>

Step 5: Run >> docker run  test_logstash:v1 or <image id>

Merge Panda Dataframe and Remove NaN Records


Below method help developer to merge multiple dataframe with same number of columns into single dataframe

Assume we have dataframes : r1 and r2, and we need to ignore null records then you can use below command with dropna().

merged_df = pd.concat([r1, r2], axis=0).dropna()
merged_df.to_csv('output.csv', index=False, doublequote=False)

Split Strings in Bigquery Using REGEXP

Split Strings in Bigquery Using REGEXP

Assume that we have a bigquery column with values like below:

---------------------------------------------------------
pair
----------------------------------------------------------
television:100
mobile:250
driver: 110
----------------------------------------------------------

Expected Output
---------------------------------------------------------
Device                         | Cost
---------------------------------------------------------
television                    |100
mobile                        | 250
driver                          | 110
----------------------------------------------------------

Use below bigquery statements to split the column:

 CASE
      WHEN REGEXP_MATCH(pair,":") THEN REGEXP_EXTRACT(pair, r'(\w*):')
      ELSE pair
    END AS attribute_name,
    REGEXP_EXTRACT(pair, r'\:(.*)') AS attribute_value

Python Fundamental - Operators

#Save the below code as python file and execute to see output.

varA = 15
varB = 6

# 1. Addition operator
add_sample = varA + varB
print(add_sample)

# 2. Subtract operator
sub_sample = varA - varB
print(sub_sample)

# 3. Multiply operator
multiply_sample = varA * varB
print(multiply_sample)

# 4. Division operator
division_sample = varA / varB
print(division_sample)

#5. Add Assignment (usefull for loop statement, any one below method can be used)

add_sample += 3
print(add_sample)

add_sample = add_sample + 1
print(add_sample)

# Similarly for other operators, use operator sign befor equal to assign value:
# examples:  -=, *=, /=

#7 Modulus

mod_sample = varA % varB
print(mod_sample)


#8 exponentiation

exp_sample = varA ** 2
print(exp_sample)

# Note: Operator Rule
# BODMAS: Bracket Orders Division Multiple Addition Subtraction

Python Fundamental - Strings and Indexes Example

#Save the below code as python file and execute to see output.

# 1. Normal

strA = 'My First string in quotes'
strB = "My first string in double quotes"

print (strA + "; " + strB)

#2. Escape Sequence
# escA= "My "first" double quotes" (This will result in error)
escA = "My 'first' double quotes"
escB = "My \"first\" double quotes"
print( escA + "; " + escB)

#3 String Index
# Index starts at 0 in python
indA = strA[0]
indB = strA[5]
print("Print indexes: " + indA + "; " + indB)

#4 Slicing of Strings

strC = "Python"

sliceA = strC[:3] #gives first 3 characters
sliceB = strC[3:] #gives last 3 characters
sliceC = strC[2:4] #gives 3 and 4 characters
print("Print slice indexes: " + sliceA + "; " + sliceB + "; "+ sliceC)

Python Fundamental Variables and Datatypes - Examples

#Save the below code as python file and execute to see output.

# 1. Add a variable and assign datatype int

myInt = 5
print(myInt)

# 2. Add a variable and assign datatype float

myFloat = 5.5
print(myFloat)

 # 3. Add a variable and assign datatype boolean
myBool = True
print(myBool)

Sunday, January 21, 2018

Bigquery Data Load with Command Line


Data Load with Command Line

Data load using bq involves three types:
1. Empty (Default): It writes data into an empty table, if data already exists it throws error.
bq query  ---n=1000 --destination_table=<table_name> 'SELECT * FROM [project:dataset.source_table];'
2. Replace: It replace a current table with newly obtained data output. 
It involves loss of existing data in a destination table. 
Use it wisely to perform incremental load which involves update and inserts.
bq query  ---replace --destination_table=<table_name> 'SELECT * FROM [project:dataset.source_table];'
3. Append: It appends new records to the existing table. 
If same command is executed more than one time it will create duplicate records. 
Can be used for incremental load which involves only data insert. 
bq query  ---append_table --destination_table=<table_name> 'SELECT * FROM [project:dataset.source_table];'

SQL 2016 New Features


  • Query Store: The Query Store feature maintains a history of query execution plans with their performance data, and quickly identifies queries that have gotten slower recently, allowing administrators or developers to force the use of an older, better plan if needed.
  • Polybase: This feature will benefit you if your regular data processing involves dealing with a lot of large text files -- they can be stored in Azure Blob Storage or Hadoop, and queried as if they were database tables. 
  • Stretch Database: The basics of Stretch Database are that some part of your tables (configurable or automated) will be moved into an Azure SQL Database in the cloud in a secure fashion. When you query those tables, the query optimizer knows which rows are on your server and which rows are in Azure, and divides the workload accordingly. 
  • JSON Support: Providing the ability to quickly move JSON data into tables
  • Row Level Security: This restricts which users can view what data in a table, based on a function. SQL Server 2016 introduces this feature, which is very useful in multi-tenant environments where you may want to limit data access based on customer ID.
  • Always Encrypted: Always Encrypted is new functionality through the use of an enhanced client library at the application so the data stays encrypted in transit, at rest and while it is alive in the database.
  • In-Memory Enhancements: Optimally designed for high-speed loading of data with no locking issues or high-volume session state issues.