Labels

Apache Hadoop (3) ASP.NET (2) AWS S3 (2) Batch Script (3) BigQuery (21) BlobStorage (1) C# (3) Cloudera (1) Command (2) Data Model (3) Data Science (1) Django (1) Docker (1) ETL (7) Google Cloud (5) GPG (2) Hadoop (2) Hive (3) Luigi (1) MDX (21) Mongo (3) MYSQL (3) Pandas (1) Pentaho Data Integration (5) PentahoAdmin (13) Polybase (1) Postgres (1) PPS 2007 (2) Python (13) R Program (1) Redshift (3) SQL 2016 (2) SQL Error Fix (18) SQL Performance (1) SQL2012 (7) SQOOP (1) SSAS (20) SSH (1) SSIS (42) SSRS (17) T-SQL (75) Talend (3) Vagrant (1) Virtual Machine (2) WinSCP (1)

Saturday, November 11, 2017

How to Remove HTML Tags and Get the Word Count of a Content in Bigquery


REGEXP to remove html tags and get the word count of a content

SELECT content_id, COUNT(words) wordcount
FROM (
SELECT content_id, SPLIT(tt, '') words
FROM (
SELECT content_id,
REGEXP_REPLACE(content, r'(<[^>]+>|\&(nbsp;)|(amp;)|&#\d\d\d\d)', '') tt,
FROM [project:dataset.table] ))
GROUP BY content_id