About This Blog

Welcome to a blog by just another Technology Enthusiast.

Since 2007, I am writing this blog to organize my learning on DB/DWH/ETL/BI/Big Data (Hadoop/Spark)/Analytics concepts, architecture, development & performance tuning. Currently, I am posting/documenting my learning in my own way (notes, examples, workshops etc), but I try to mention the source of learning in the references for visitors to explore further.

Again, this blog, not necessarily, covers the topic from scratch neither it promises to be accurate enough to be implemented directly, hence, please test and verify the learning on your dev/test environment before implementing.

I invite you for suggestions/feedback/queries/doubts.

My Experiment with Titanic

Due to limitations of Blogger editor for code-format, I have written this post on my Kaggle account:

My Experiment with Titanic

Here, you will learn these things:

Introduction
-Import datasets
-Analyse structure
-Survival stats & graphs

Gender Age Fare Model
-Survival stats/graphs by gender
-Survival stats/graphs by Age
-Survival stats/graphs by Fare

Decision Trees
-Train the I Dtree model
-Predict & submit

Feature Engineering
-Feature engineering on Name
-Feature engineering on Family size
-Survival stats/graphs by Title & Family size
-Train the II Dtree model
-Predict & submit

Random Forests
-Filling missing Age
-Filling missing Embarked
-Filling missing fare
-Train the Rforest model
-Predict & submit
-Feature importance graph
-Train the Cforest model
-Predict & submit

References:
Titanic: Getting Started With R
Exploring Survival on the Titanic

Text Mining in R

Install the required packages

Required <- c("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc")   
 install.packages(Required, dependencies=TRUE)

 Download the Text

TEXTFILE = "data/pg100.txt"
dir.create(dirname(TEXTFILE), FALSE)
download.file("http://www.gutenberg.org/cache/epub/100/pg100.txt", destfile = TEXTFILE)

Load the Text

shakespeare = readLines(TEXTFILE)
length(shakespeare)

head(shakespeare)
tail(shakespeare)

Remove header/footer

shakespeare = shakespeare[-(1:173)]
shakespeare = shakespeare[-(124195:length(shakespeare))]

shakespeare = paste(shakespeare, collapse = " ")
nchar(shakespeare)

Split the Text

shakespeare = strsplit(shakespeare, "<<[^>]*>>")[[1]]
length(shakespeare)

Remove dramatis personae

(dramatis.personae <- grep("Dramatis Personae", shakespeare, ignore.case = TRUE))
length(shakespeare)
shakespeare = shakespeare[-dramatis.personae]
length(shakespeare)

Convert the docs into Corpus

library(tm)
 
docs <- VectorSource(shakespeare)
docs <- Corpus(docs)
summary(docs)

 Clean the text

docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))

library(SnowballC)

docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
inspect(docs[8])

 Prepare DTM & TDM

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs)

 Frequent terms

findFreqTerms(tdm, 2000)

 Find associations

findAssocs(tdm, "love", 0.8)

findAssocs(dtm, c("thou" , "thee"), corlimit=0.98)

 List top 20 frequent terms

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 20)

wf <- data.frame(word=names(freq), freq=freq)   
head(wf)

 Plot Word Frequencies

library(ggplot2)   
wf <- data.frame(word=names(freq), freq=freq)   
p <- ggplot(subset(wf, freq>2000), aes(word, freq))    
p <- p + geom_bar(stat="identity")   
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))   
p

 Word Clouds

library(wordcloud)
set.seed(123)   
wordcloud(names(freq), freq, min.freq=20, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

 Hierarchical Clustering

dtmss <- removeSparseTerms(dtm, 0.1)

library(cluster)   
d <- dist(t(dtmss), method="euclidian")   
fit <- hclust(d=d, method="ward")   
fit   
plot(fit, hang=-1)  

plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5)   # "k=" defines the number of clusters you are using   
rect.hclust(fit, k=5, border="red") # draw dendogram with red borders around the 5 clusters

K-means Clustering

library(fpc)   
d <- dist(t(dtmss), method="euclidian")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Plotting a heat-map

library(ggplot2)
tdm.common = removeSparseTerms(tdm, 0.1)
tdm.dense <- as.matrix(tdm.common)


ggplot(tdm.dense, aes(x = Docs, y = Terms, fill = log10(count))) +
    geom_tile(colour = "white") +
    scale_fill_gradient(high="#FF0000" , low="#FFFFFF")+
    ylab("") +
    theme(panel.background = element_blank()) +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) 

References

Hands-on with Spark in Scala

Loading in an external DataFrame

// To create a Spark DataFrame we load an external DataFrame, mtcars 
scala> import sys.process._
import java.net.URL
import java.io.File

scala> def fileDownloader(url: String, filename: String) = {
    new URL(url) #> new File(filename) !!
}

scala> fileDownloader("https://ibm.box.com/shared/static/f1dhhjnzjwxmy2c1ys2whvrgz05d1pui.csv", "/resources/mtcars.csv")

 

Initialize SQLContext

// To work with dataframes we need a SQLContext  
scala> import au.com.bytecode.opencsv.CSVParser
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

 

Creating Spark DataFrames

// Define the schema using a case class.
scala> case class Cars(car: String, mpg: String, cyl: String, disp: String, hp: String, drat: String, wt: String, qsec: String, vs: String, am: String, gear: String, carb: String)
 
// Create an RDD of Person objects and register it as a table.
scala> val mtcars = sc.textFile("/resources/mtcars.csv").map(_.split(",")).map(p => Cars(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9), p(10), p(11))).toDF()
mtcars.printSchema

root
 |-- car: string (nullable = true)
 |-- mpg: string (nullable = true)
 |-- cyl: string (nullable = true)
 |-- disp: string (nullable = true)
 |-- hp: string (nullable = true)
 |-- drat: string (nullable = true)
 |-- wt: string (nullable = true)
 |-- qsec: string (nullable = true)
 |-- vs: string (nullable = true)
 |-- am: string (nullable = true)
 |-- gear: string (nullable = true)
 |-- carb: string (nullable = true)

 

Displays the content of the DataFrame

scala> mtcars.show(10)

+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|              car| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|              car| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
|        Mazda RX4|  21|  6|  160|110| 3.9| 2.62|16.46|  0|  1|   4|   4|
|    Mazda RX4 Wag|  21|  6|  160|110| 3.9|2.875|17.02|  0|  1|   4|   4|
|       Datsun 710|22.8|  4|  108| 93|3.85| 2.32|18.61|  1|  1|   4|   1|
|   Hornet 4 Drive|21.4|  6|  258|110|3.08|3.215|19.44|  1|  0|   3|   1|
|Hornet Sportabout|18.7|  8|  360|175|3.15| 3.44|17.02|  0|  0|   3|   2|
|          Valiant|18.1|  6|  225|105|2.76| 3.46|20.22|  1|  0|   3|   1|
|       Duster 360|14.3|  8|  360|245|3.21| 3.57|15.84|  0|  0|   3|   4|
|        Merc 240D|24.4|  4|146.7| 62|3.69| 3.19|   20|  1|  0|   4|   2|
|         Merc 230|22.8|  4|140.8| 95|3.92| 3.15| 22.9|  1|  0|   4|   2|
+-----------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 10 rows

Selecting columns

scala> mtcars.select("car", "mpg", "wt").show(10)

+-----------------+----+-----+
|              car| mpg|   wt|
+-----------------+----+-----+
|              car| mpg|   wt|
|        Mazda RX4|  21| 2.62|
|    Mazda RX4 Wag|  21|2.875|
|       Datsun 710|22.8| 2.32|
|   Hornet 4 Drive|21.4|3.215|
|Hornet Sportabout|18.7| 3.44|
|          Valiant|18.1| 3.46|
|       Duster 360|14.3| 3.57|
|        Merc 240D|24.4| 3.19|
|         Merc 230|22.8| 3.15|
+-----------------+----+-----+
only showing top 10 rows

 

Filtering Data

scala> mtcars.filter(mtcars("mpg") < 18).show(10)

+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|                car| mpg|cyl| disp| hp|drat|   wt| qsec| vs| am|gear|carb|
+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
|         Duster 360|14.3|  8|  360|245|3.21| 3.57|15.84|  0|  0|   3|   4|
|          Merc 280C|17.8|  6|167.6|123|3.92| 3.44| 18.9|  1|  0|   4|   4|
|         Merc 450SE|16.4|  8|275.8|180|3.07| 4.07| 17.4|  0|  0|   3|   3|
|         Merc 450SL|17.3|  8|275.8|180|3.07| 3.73| 17.6|  0|  0|   3|   3|
|        Merc 450SLC|15.2|  8|275.8|180|3.07| 3.78|   18|  0|  0|   3|   3|
| Cadillac Fleetwood|10.4|  8|  472|205|2.93| 5.25|17.98|  0|  0|   3|   4|
|Lincoln Continental|10.4|  8|  460|215|   3|5.424|17.82|  0|  0|   3|   4|
|  Chrysler Imperial|14.7|  8|  440|230|3.23|5.345|17.42|  0|  0|   3|   4|
|   Dodge Challenger|15.5|  8|  318|150|2.76| 3.52|16.87|  0|  0|   3|   2|
|        AMC Javelin|15.2|  8|  304|150|3.15|3.435| 17.3|  0|  0|   3|   2|
+-------------------+----+---+-----+---+----+-----+-----+---+---+----+----+
only showing top 10 rows

 

Operating on Columns

scala> mtcars.withColumn("wtTon", mtcars("wt") * 0.45).select("car", "mpg", "disp", "wtTon").show(10) 
+-----------------+----+-----+-------+
|              car| mpg| disp|  wtTon|
+-----------------+----+-----+-------+
|              car| mpg| disp|   null|
|        Mazda RX4|  21|  160|  1.179|
|    Mazda RX4 Wag|  21|  160|1.29375|
|       Datsun 710|22.8|  108|  1.044|
|   Hornet 4 Drive|21.4|  258|1.44675|
|Hornet Sportabout|18.7|  360|  1.548|
|          Valiant|18.1|  225|  1.557|
|       Duster 360|14.3|  360| 1.6065|
|        Merc 240D|24.4|146.7| 1.4355|
|         Merc 230|22.8|140.8| 1.4175|
+-----------------+----+-----+-------+
only showing top 10 rows

 

Grouping, Aggregation

scala> import org.apache.spark.sql.functions._
mtcars.groupBy("cyl").agg(avg("wt")).show(5) 
+---+-----------------+
|cyl|          avg(wt)|
+---+-----------------+
|  4|2.285727272727273|
|  6|3.117142857142857|
|  8|3.999214285714286|
|cyl|             null|
+---+-----------------+

// We can also sort the output from the aggregation to get the most common cars 
scala> mtcars.groupBy("cyl").agg(count("wt"), sum("wt")).sort($"count(wt)".desc).show(5)
+---+---------+------------------+
|cyl|count(wt)|           sum(wt)|
+---+---------+------------------+
|  8|       14|55.989000000000004|
|  4|       11|            25.143|
|  6|        7|             21.82|
|cyl|        1|              null|
+---+---------+------------------+

 

Running SQL Queries from Spark DataFrames

// Register this DataFrame as a table.
scala> mtcars.registerTempTable("cars")
 
// SQL statements can be run by using the sql methods provided by sqlContext.
scala> val highgearcars = sqlContext.sql("SELECT car, mpg, gear FROM cars WHERE cyl >= 4 AND cyl <= 9")
highgearcars.show(6)

+-----------------+----+----+
|              car| mpg|gear|
+-----------------+----+----+
|        Mazda RX4|  21|   4|
|    Mazda RX4 Wag|  21|   4|
|       Datsun 710|22.8|   4|
|   Hornet 4 Drive|21.4|   3|
|Hornet Sportabout|18.7|   3|
|          Valiant|18.1|   3|
+-----------------+----+----+
only showing top 6 rows

 

References

Spark Quick Start
IBM Data Scientist Workbench

Evolution Of Big Data (from Hadoop to Spark)

Big Data
-Data is the lifeblood of any organization
-Data is growing exponentially
-Getting business value out of data
-Volume/Velocity/Variety

Drivers for Big Data
-Limitations of traditional RDBMSs
-Proprietary MPP products (Teradata/Netezza/Greenplum) are expensive
-Scalability/High Availability/Fault Tolerance/Low Cost

Hadoop
-Implementation of Big Data
-Seperated data processing from distributed computing
-Cluster Manager (YARN)
-Distributed Computing (MapReduce)
-Distributed File System (HDFS)

Hadoop 1.0
-Client/Master/Slave
-Client(load data, submit jobs, retrieve data)
-Master(NameNode, Secondary NameNode & JobTracker)
-Slave(DataNode, TaskTracker)

Hadoop 2.0
-Master(Active NameNode, Standby NameNode, Secondary NameNode, Resource Manager)
-Slave(DataNode, NodeManager, Application Master)

Hadoop 1.0 Vs 2.0
-HDFS Federation – horizontal scalability of NameNode
-NameNode High Availability – NameNode is no longer a Single Point of Failure
-YARN – ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH
-Resource Manager – splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons: a global Resource Manager and per-application ApplicationMaster
Spark
-Fast in-memory(RDDs) processing
-Support for sophisticated analytics
-Real-time stream processing
-Written in Scala

Components
-Spark Core (a driver program, a cluster manager, workers, executors, and tasks)
-Spark SQL (SQL interface for Spark)
-Spark Streaming (for real-time processing)
-MLlib (for large-scale ML operations)
-GraphX (for large-scale graph processing)

MapReduce Vs Spark
-100 times faster than MapReduce
-Stores data in memory
-Real time processing
-Iterative ML algorithms

Apache Spark with Hadoop
-Alternative to MapReduce
-Compatible with HDFS, HBASE & YARN
-Support for multiple languages
-Stable APIs & Library support

References:
Hadoop Architecture
Hadoop 2.0 Architecture
Hadoop 1.0 Vs 2.0
Apache Spark Vs Hadoop MapReduce
Apache Spark with Hadoop

Internet of Things (IoT) Primer

What is it?
The Internet of Things (IoT) is the ability for things that contain embedded technologies to sense, communicate, interact, and collaborate with other things, thus creating a network of physical objects. In recent years this concept has gained enormous momentum, and is now one of the most talked about things in the world of technology today. At this rapid rate of growth, it is projected that there will be approximately 26 billion connected devices by 2020.

IoT Building Blocks

Sensors & Sensor technology – Location, Weather, Movement, Temperature, Environment
IoT Gateways – to bridge the internal network of sensor nodes with the external Internet or World Wide Web
Cloud/server infrastructure & Big Data – to store, process & analyse transmitted data
End-user Mobile apps – to help consumer control & monitor their devices
IPv6 – for IP addresses to connect over internet

Wireless Technologies

RFID - identify and track tags attached to objects (using EM fields)
NFC - a set of communication protocols to enable two electronic devices communicate
WiFi - allows electronic devices to connect to a WLAN
BLE - wireless technology standard for PANs
XBee - a family of form factor compatible radio modules
Zigbee - a mesh network specification for low-power WLANs that cover a large area

Why is it required?

For Consumer
-Ability to remotely monitor and control devices for peace of mind
-Improve energy efficiency of household devices
-Extends lifecycle for products, through the ability to update
-Ability to integrate multiple connected products for improved customer experience
-Ability for products to manage themselves

For Enterprise
-Gain insight into customer usage and device performance to improve future products
-Open new opportunity to monetize value added service around product usage
-Better insight into supply chain management
-Ability to update products in the field to enhance capabilities and extend lifecycle



How does it work?

1. Sensors connects with cloud infra via IoT gateways (RESTful connection)
2. Cloud/server possesses the detailed records from sensors
3. Consumer connects, check status & send requests with cloud through mobile app after applicable authentication
4. Cloud send the request to sensor via IoT gateway

Latest Trends by Gartner

–Spending on IoT services will reach of $235 billion in 2016, up 22% from 2015.
–In terms of hardware spending, spending on consumer applications will total $546 billion, and the use of connected things in the enterprise will drive $868 billion in 2016.
– 6.4 billion connected things will be in use worldwide in 2016, up 30% from 2015, reaching 20.8 billion by 2020.
–In 2016, 5.5 million new things will get connected every day.
–More than half of major new business processes and systems will incorporate some element of the Internet of Things (IoT) by 2020.
–Through 2018, 75% of IoT projects will take up to twice as long as planned.
–By 2020, a black market exceeding $5 billion will exist to sell fake sensor and video data for enabling criminal activity and protecting personal privacy.
-By 2020, addressing compromises in IoT security will have increased security costs to 20% of annual security budgets, from less than one percent in 2015.


References
How the Internet of Things Works
How IoT works – an overview of the technology architecture
How the AWS IoT Platform Works
Internet Of Things (IoT) Predictions From Forrester, Machina Research, WEF, Gartner, IDC

Digital Enterprise Overview

What is this?
A digital enterprise is an organization that uses technology as a competitive advantage in its internal and external operations.

Why it is required?
-Collission of an unprecedented wave of innovation
-Digital & business have become synonymous
-Requirement of new framework of understanding & action

Disruptions/Drivers
-Social media
-Mobility
-Analytics (Big Data)
-Cloud computing

How to go about it?
Focus Areas
-Strategy & Innovation
--Vision on where the value proposition is shifting
--New possible ways of doing business
-Customer Decision Journey
--Analyse & understand customer behaviours and influences
--Identify high-value microsegments
-Process Automation
--Reinvent & simplify business processes
--Employ agile development
-Organization
--Structure to follow Strategy
--Integrate COEs with main business
--Measure to manage
-Technology
--Drive innovation, automation & personalization
--Two-speed IT model
-Data & Analytics
--Decide which data to use
--Focus analytics on delivery with insight
--Build right capabilities & processes

Digital Business Model
-Digital products and services
--More personalized
-Seamless multichannel (digital and physical) experience
-Understand customer behavior (via Big Data & Analytics)
-Automating operations and digitizing business processes

Elements Of Digital Enterprise Architecture
-Two-speed architecture
-Instant cross-channel deployment of functionality
-Zero downtime
-Real-time data analytics
-Easy process configuration
-Product factory
-Automated scaling of IT platforms
-Secure architecture

Key Players
-Amazon
-Google
-Netflix

Key Solution Providers
-TCS
-Cognizant
-Genpact
-Tech Mahindra

References:
The Digital Enterprise Vol 5
Six building blocks for creating a high-performing digital enterprise
Two Speed IT Architecture For The Digital Enterprise

Install Cloudera Hadoop Cluster on Google Cloud

Create & configure nodes on Google cloud
1. Create nodes hc2nn, hc2r1m1, hc2r1m2, hc2r1m3 on Google Cloud
2. Choose Ubuntu 14.04 (master R:13GB/S:50GB, slaves R:3.75GB/S:30GB)

Enable remote desktop connection
3. Install VNC server on hc2nn
     $ sudo apt-get update
    $ sudo apt-get install tightvncserver
4. Install Desktop env  Xfce
    $ sudo apt-get install xfce4 xfce4-goodies
5. Configure the VNC server
    $ vncserver
6. Check 5901 port
    $ nc localhost 5901
    RFB 003.008
7. Install a VNC client (RealVNC Viewer)
8. Open the firewall on Google Cloud
    Compute > Compute Engine > Network
    Add new firewall rule
    Name: vnc-server
    Source filter: IP ranges
    Source IP Ranges: 0.0.0.0/0
    Allowed ports: tcp:5901
    Add tag vnc-server to hc2nn
9. Test VNC server
    $ nc 104.197.91.140 5901
    RFB 003.008
Connect from client machine
10. Install & configure Firefox on hc2nn
    $ wget http://ftp.mozilla.org/pub/mozilla.org/firefox/releases/41.0/linux-x86_64/en-US/firefox-41.0.tar.bz2
    $ tar xvjf firefox-41.0.tar.bz2
    $ mv firefox /usr/local/
    $ ln -s /usr/local/firefox/firefox /usr/bin/firefox

Enable SSH password-less connection b/w nodes
11. Login via root on every node
12. Generate SSH key on hc2nn & copy over other nodes
    $ ssh-keygen
    $ ssh-copy-id root@hc2r1m1
    $ .....
13. Test connectivity
    $ ssh root@hc1r1m1
    $ .....

Download/Install Cloudera Manager 5 on hc2nn
14. CM Installer
    $ wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
    $ chmod u+x cloudera-manager-installer.bin
    $ sudo ./cloudera-manager-installer.bin
Keep proceeding & accepting till Finish
15. Uninstall Cloudera Manager (just in case)
    $ sudo /usr/share/cmf/uninstall-cloudera-manager.sh

Configure hadoop cluster using Cloudera Manager Admin Console
16. Login as Admin/admin & configure hadoop cluster
    http://ServerHost:7180
17. Skip "Enable Single User Mode" & install trial edition
18. Install & set-up the cluster selecting services
Hadoop (Common, HDFS, MapReduce, YARN), HBase, ZooKeeper, Oozie, Hive, Hue, Flume, Impala
19. Proceed till finish & verify first-run.
    

References:
https://medium.com/google-cloud/linux-gui-on-the-google-cloud-platform-800719ab27c5
http://www.bogotobogo.com/Hadoop/BigData_hadoop_CDH5_Install.php

Data Analysis with R

Structure of data

> str(emp)
'data.frame': 20 obs. of  6 variables:
 $ id       : int  101 102 103 104 105 106 107 108 109 110 ...
 $ name     : chr  "Ankit" "Amit" "Ashu" "Nitin" ...
 $ salary   : int  1100 1900 2700 2500 1400 3100 2000 3500 2500 1500 ...
 $ status   : chr  "Active" "Inactive" "Active" "Active" ...
 $ dep      : int  10 20 20 30 10 10 20 20 30 10 ...
 $ exp      : int  3 6 8 7 3 9 5 9 8 4 ...

Display data

> emp
    id    name salary   status dep exp
1  101   Ankit   1100   Active  10   3
2  102    Amit   1900 Inactive  20   6
3  103    Ashu   2700   Active  20   8
4  104   Nitin   2500   Active  30   7
5  105  Mukund   1400   Active  10   3
6  106   Rathi   3100   Active  10   9
7  107   Dixit   2000   Active  20   5
8  108   Gupta   3500   Active  20   9
9  109   Verma   2500   Active  30   8
10 110  Sharma   1500   Active  10   4
11 111    Nain   1000   Active  10   3
12 112   Shail   1200   Active  20   3
13 113   Yadav   2900   Active  20   9
14 114  Pankaj   2400   Active  30   7
15 115    Seth   1600 Inactive  10   5
16 116  Rajesh   1700   Active  10   5
17 117 Dwivedi   1900   Active  20   6
18 118  Pandey   3100 Inactive  20   9
19 119   Khari   2200   Active  30   7
20 120     Dev   1100   Active  10   2

Exploring numeric variables

> summary(emp$exp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.75    6.00    5.90    8.00    9.00 

> summary(emp[c("salary", "exp")])
     salary            exp      
   Min.   :1000     Min.   :2.00  
  1st Qu. :1475   1st Qu.  :3.75  
 Median   :1950   Median   :6.00  
  Mean    :2065    Mean    :5.90  
 3rd Qu.  :2550    3rd Qu. :8.00  
  Max.    :3500      Max.  :9.00  

Mean and Median

> mean(emp$salary)
[1] 2065
> median(emp$salary)
[1] 1950
> mean(emp$exp)
[1] 5.9
> median(emp$exp)
[1] 6

Range & Diff

> range(emp$salary)
[1] 1000 3500
> diff(range(emp$salary))
[1] 2500

Interquartile range (IQR)

> IQR(emp$salary)
[1] 1075

> quantile(emp$salary)
  0%  25%  50%  75% 100% 
1000 1475 1950 2550 3500 

Boxplots

> boxplot(emp$salary)



> boxplot(emp$exp)














Histograms

> hist(emp$salary)














> hist(emp$exp)














Variance and Standard Deviation

> var(emp$salary)
[1] 564500
> sd(emp$salary)
[1] 751.3322
> var(emp$exp)
[1] 5.568421
> sd(emp$exp)
[1] 2.35975

Categorical

> table(emp$dep)

10 20 30 
 8  8  4 

> table(emp$exp)

2 3 4 5 6 7 8 9 
1 4 1 3 2 3 2 4 

Scatterplots

> plot(x = emp$exp, y = emp$salary)














Two-way Cross-tabulations

> CrossTable(x = emp$dep, y = emp$status)

Basics of R

Vectors
-stores an ordered set of values called elements
-integer (numbers without decimals)
-numeric (numbers with decimals)
-character (text data)
-logical (TRUE or FALSE values)

> name <- c("Ankit", "Amit", "Ashu")
> salary<- c(1100, 1000, 1500)
> status <- c(FALSE, FALSE, TRUE)
> name
[1] "Ankit" "Amit"  "Ashu"
> salary
[1] 1100 1000 1500
> status
[1] FALSE FALSE  TRUE


Factors
-categories of values are known as nominal

> gender <- factor(c("MALE", "FEMALE", "MALE"))
> gender
[1] MALE   FEMALE MALE
Levels: FEMALE MALE

Lists
-all elements can be of the diff type

> emp1 <- list(name = name[1],
+              salary= salary[1],
+              status = status[1],
+              gender = gender[1])
> emp1
$name
[1] "Ankit"

$salary
[1] 1100

$status
[1] FALSE

$gender
[1] MALE
Levels: FEMALE MALE

Data frames
-analogous to a spreadsheet or database

> emp<- data.frame(name, salary, status, gender,
+                  stringsAsFactors = FALSE)
> emp
   name salary status gender
1 Ankit   1100  FALSE   MALE
2  Amit   1000  FALSE FEMALE
3  Ashu   1500   TRUE   MALE

Matrixes and arrays
-matrixes can contain any single type of data (mainly numeric)

> m <- matrix(c('a', 'b', 'c', 'd', 'e', 'f'), nrow = 2)
> m
     [,1] [,2] [,3]
[1,] "a"  "c"  "e"
[2,] "b"  "d"  "f"
> m <- matrix(c('a', 'b', 'c', 'd', 'e', 'f'), ncol = 2)
> m
     [,1] [,2]
[1,] "a"  "d"
[2,] "b"  "e"
[3,] "c"  "f"

Saving and loading R data structures

> save(x, y, z, file = "mydata.RData")
> load("mydata.RData")

Importing data from CSV files

> emp<- read.csv("emp.csv", stringsAsFactors = FALSE)

Saving data to CSV files

> write.csv(pt_data, file = "emp.csv")

Importing data from SQL databases

> install.packages("RODBC")
> library(RODBC)
> mydb <- odbcConnect("my_dsn", uid = "my_username" pwd = "my_password")
> emp_query <- "select * from emp where dep= 20"
> emp_data <- sqlQuery(channel = mydb, query = emp_query,
> stringsAsFactors = FALSE)
> odbcClose(mydb)

References:
Cookbook for R

Data Science Specialization


Coursera now offers a Data Science Specialization. The courses are taught by Johns Hopkins University. If you would like to earn a Specialization Certificate, each course will cost you $49 otherwise you can take the courses for free without earning the certificate.

The Specialization consists of the following courses. You must complete all courses for the certification. If you are not interested in the certificate, you can take any or all the courses.

The Data Scientist’s Toolbox
An introduction to the main tools and ideas in the data scientist's toolbox. The course gives an overview of the data, questions, and tools that data analysts and data scientists work with.

R Programming
How to program in R and how to use R for effective data analysis. 

Getting and Cleaning Data
The basic ways that data can be obtained from the web, from APIs, from databases and from colleagues in various formats. 

Exploratory Data Analysis
The essential exploratory techniques for summarizing data

Reproducible Research
The concepts and tools behind reporting modern data analyses in a reproducible manner.

Statistical Inference
The process of drawing conclusions about populations or scientific truths from data. 

Regression Models
An outcome to a set of predictors of interest using linear assumptions.

Practical Machine Learning
The basic components of building and applying prediction functions with an emphasis on practical applications. 

Developing Data Products
Data products automate complex analysis tasks or use technology to expand the utility of a data informed model, algorithm or inference.

Workshop: Capstone Project

References:
Data Science Specialization

Hierarchical Queries Workshop

Hierarchical Queries Overview
-can select rows in a hierarchical order using the hierarchical query clause
-evaluation process
--JOINs
--CONNECT BY
--WHERE
-hierarchy formation
--selects the root row(s) of the hierarchy (START WITH)
--selects the child rows of each root row and so on
--eliminates child rows based on WHERE clause (individually)

CONNECT BY: specifies the relationship between parent rows and child rows of the hierarchy
PRIOR: to refer to the parent row
SELECT employee_id, last_name, manager_id
FROM employees
CONNECT BY PRIOR employee_id = manager_id;

EMPLOYEE_ID LAST_NAME MANAGER_ID
-----------------------------------------
120         Weiss     100
125         Nayer     120
182         Sullivan  120
183         Geoni     120
181         Fleaur    120

LEVEL: shows levels in hierarchy
SELECT employee_id, last_name, manager_id, LEVEL
FROM employees
CONNECT BY PRIOR employee_id = manager_id;

EMPLOYEE_ID LAST_NAME MANAGER_ID LEVEL
-----------------------------------------------------
120         Weiss     100        1
125         Nayer     120        2
182         Sullivan  120        2
183         Geoni     120        2
181         Fleaur    120        2

START WITH: to specify a root row for the hierarchy
ORDER SIBLINGS BY: to preserve ordering within the hierarchy
SELECT last_name, employee_id, manager_id, LEVEL
FROM employees
START WITH employee_id = 100
CONNECT BY PRIOR employee_id = manager_id
ORDER SIBLINGS BY last_name;

LAST_NAME EMPLOYEE_ID MANAGER_ID LEVEL
-----------------------------------------------------
King      100         -          1
Cambrault 148         100        2
Bates     172         148        3
Bloom     169         148        3
Fox       170         148        3

SYS_CONNECT_BY_PATH: to retrieve the path from root to node
SELECT last_name, employee_id, manager_id, 
   LEVEL, SYS_CONNECT_BY_PATH(last_name, '/') Path
FROM employees
START WITH employee_id = 100
CONNECT BY PRIOR employee_id = manager_id
ORDER SIBLINGS BY last_name;

LAST_NAME EMPLOYEE_ID MANAGER_ID LEVEL PATH
--------------------------------------------------------------------
King      100         -          1     /King
Cambrault 148         100        2     /King/Cambrault
Bates     172         148        3     /King/Cambrault/Bates
Bloom     169         148        3     /King/Cambrault/Bloom
Fox       170         148        3     /King/Cambrault/Fox

NOCYCLE: to return the rows in spite of the loop
UPDATE employees 
SET manager_id = 145
WHERE employee_id = 100;

SELECT last_name "Employee", 
  LEVEL, SYS_CONNECT_BY_PATH(last_name, '/') "Path"
FROM employees
WHERE level <= 3 AND department_id = 80
START WITH last_name = 'King'
CONNECT BY PRIOR employee_id = manager_id AND LEVEL <= 4;

ERROR:
ORA-01436: CONNECT BY loop in user data

SELECT last_name "Employee", CONNECT_BY_ISCYCLE "Cycle",
  LEVEL, SYS_CONNECT_BY_PATH(last_name, '/') "Path"
FROM employees
WHERE level <= 3 AND department_id = 80
START WITH last_name = 'King'
CONNECT BY NOCYCLE PRIOR employee_id = manager_id AND LEVEL <= 4
ORDER BY "Employee", "Cycle", LEVEL, "Path";

Employee Cycle LEVEL Path
---------------------------------------
Abel      0     3     /King/Zlotkey/Abel
Ande      0     3     /King/Errazuriz/Ande
Banda     0     3     /King/Errazuriz/Banda
Bates     0     3     /King/Cambrault/Bates
Bernstein 0     3     /King/Russell/Bernstein
Bloom     0     3     /King/Cambrault/Bloom
......................................................................................
Russell   1     2     /King/Russell
......................................................................................

CONNECT_BY_ISLEAF: to turn column-values into a comma-delimited list
SELECT LTRIM(SYS_CONNECT_BY_PATH (employee_id,','),',') employee_ids
FROM
   (SELECT ROWNUM r, employee_id 
    FROM employees
    WHERE rownum <=10)
WHERE CONNECT_BY_ISLEAF = 1
START WITH r = 1
CONNECT BY r = PRIOR r + 1
ORDER BY employee_id;  

EMPLOYEE_IDS
--------------
100,101,102,103,104,105,106,107,108,109

CONNECT_BY_ROOT: to return levels and path b/w root and node
SELECT last_name "Employee", CONNECT_BY_ROOT last_name "Manager",
   LEVEL-1 "Pathlen", SYS_CONNECT_BY_PATH(last_name, '/') "Path"
FROM employees
WHERE LEVEL > 1 and department_id = 60
CONNECT BY PRIOR employee_id = manager_id
ORDER BY "Employee", "Manager", "Pathlen", "Path";

Employee Manager Pathlen Path
--------------------------------------------
Austin   De Haan 2       /De Haan/Hunold/Austin
Austin   Hunold  1       /Hunold/Austin
Austin   King    3       /King/De Haan/Hunold/Austin
Ernst    De Haan 2       /De Haan/Hunold/Ernst
Ernst    Hunold  1       /Hunold/Ernst
Ernst    King    3       /King/De Haan/Hunold/Ernst

GROUP BY with CONNECT_BY_ROOT
SELECT name "Name", SUM(salary) "Total_Salary" 
FROM ( SELECT CONNECT_BY_ROOT last_name as name, Salary
       FROM employees
       WHERE department_id = 110
       CONNECT BY PRIOR employee_id = manager_id )
GROUP BY name
ORDER BY name, "Total_Salary";

Name    Total_Salary
--------------------
Gietz   8300
Higgins 20300
King    20300
Kochhar 20300

Reference:
Hierarchical Queries 

Oracle Regular Expression Workshop

Regular Expression Overview
-finding, counting, positioning, matching and replacing complex patterns
-using meta-characters and literals
-centralized pattern-matching logic
-enforcing constraints at server-side

 REGEXP_LIKE: to match the given pattern
SELECT employee_id, first_name, last_name
FROM employees
WHERE REGEXP_LIKE(first_name, '^Ste(v|ph)en$');


EMPLOYEE_ID FIRST_NAME LAST_NAME
------------------------------------------
100         Steven     King
128         Steven     Markle
138         Stephen    Stiles

REGEXP_COUNT: count of occurence of given pattern
SELECT REGEXP_COUNT('Albert Einstein', 'e', 3, 'i') occurence 
FROM dual;

OCCURENCE
----------
3

SELECT REGEXP_COUNT('Albert Einstein', 'e', 3, 'c') occurence 
FROM dual;

OCCURENCE
----------
2

REGEXP_INSTR: starting or ending position of a given pattern
SELECT  lower(email) email,
     REGEXP_INSTR(email, '\w+@\w+(\.\w+)+') valid 
FROM employees;

EMAIL   VALID
----------------
sking      0
nkochhar   0
ldehaan    0
ahunold    0
bernst     0
daustin    0

SELECT  lower(email)||'@gmail.com' email,
     REGEXP_INSTR(email||'@gmail.com', '\w+@\w+(\.\w+)+') valid 
FROM employees;

EMAIL               VALID
-----------------------------
sking@gmail.com       1
nkochhar@gmail.com    1
ldehaan@gmail.com     1
ahunold@gmail.com     1
bernst@gmail.com      1
daustin@gmail.com     1

REGEXP_REPLACE: to replace a given pattern
SELECT country_name, 
   REGEXP_REPLACE(country_name, '(.)', '\1 ') new_name
FROM countries;

COUNTRY_NAME   NEW_NAME
------------------------------
Australia      A u s t r a l i a
Belgium        B e l g i u m
Brazil         B r a z i l
Switzerland    S w i t z e r l a n d
China          C h i n a
Germany        G e r m a n y
Egypt          E g y p t

REGEXP_SUBSTR: cut the matching pattern
SELECT REGEXP_SUBSTR('abcd', 'a b c d', 1, 1, 'x') cutx
FROM dual;

CUTX
------
abcd

SELECT REGEXP_SUBSTR('ab'||CHR(10)||'ac', '^a.', 1, 2, 'm') cutm
FROM dual;

CUTM
-----
ac

Using as CHECK constraint
DROP TABLE contacts;

CREATE TABLE contacts (
  l_name    VARCHAR2(30),
  p_number  VARCHAR2(30)
  CONSTRAINT c_contacts_pnf
  CHECK (REGEXP_LIKE (p_number, '^\(\d{3}\) \d{3}-\d{4}$'))
);

--insert successful
INSERT INTO contacts (p_number) VALUES('(650) 555-0100');

--generates CHECK constraint errors
INSERT INTO contacts (p_number) VALUES('650 555-0100');
INSERT INTO contacts (p_number) VALUES('(650)555-0100');

---------------------------------------------------
ORA-02290: check constraint (ONB.C_CONTACTS_PNF) violated

Reference:
Using Regular Expressions in Database Applications 

Oracle Database 12c: Multitenant Architecture

What is Multitenant Architecture? 
-Multitenant: Multiple tenants sharing same resources for mutual benefits 
-multiple Oracle Databases share a single instance of resources aiming for different purposes on the same Server. 
-CDB: Oracle Database built on Multitenant foundation 
-PDB: container(tenant) residing inside CDB 
-Root: stores Oracle-supplied metadata and common users 
-Seed PDB: a system-supplied template that the CDB can use to create new PDBs 
-User-created PDBs: contains the data and code required for a specific set of features supporting specific applications 
  
Why is Multitenant Architecture? 
-Large enterprises may use hundreds or thousands of databases 
-These databases run on different platforms on multiple physical servers 
-Because of improvements in hardware technology, servers are able to handle heavier workloads 
-Approach1: Challenges in placing multiple databases on each server (individual background processes, system and process memory & metadata)
-Approach2: Difficulties with virtual entities (in managing, securing, and transporting)

Benefits 

-Cost reduction (consolidation, resource sharing) 
-Easier Database maintenance  (rapid movement of data and code, separation of data and code, secure & separate admin duties, fewer patches and upgrades,  easier testing)
-Efficient Performance tuning (easier to collect performance metrics, single SGA) 
-Transparency (no configuration changes required, transparent to applications)

Reference:

Introduction to the Multitenant Architecture 

Clustering Factor Workshop

/*
Scenario:
Effect of clustering factor on plan & cost
Source: Asktom
*/

--Prepare test case
DROP TABLE stage;
DROP TABLE org;
DROP TABLE disorg;

CREATE TABLE stage AS
SELECT * FROM all_objects;

CREATE TABLE org AS
SELECT x.* FROM
  (SELECT * FROM stage ORDER BY object_name
  ) x ;

CREATE TABLE disorg AS
SELECT x.* FROM
  (SELECT * FROM stage ORDER BY dbms_random.random
  ) x ;

CREATE INDEX org_idx ON org
  (object_name
  );

CREATE INDEX disorg_idx ON disorg
  (object_name
  );

--Collect stats
BEGIN
  dbms_stats.gather_table_stats ( USER, 'ORG', estimate_percent => 100, method_opt=>'for all indexed columns size 254' );
  dbms_stats.gather_table_stats ( USER, 'DISORG', estimate_percent => 100, method_opt=>'for all indexed columns size 254' );
END;

--Compare blocks, rows
SELECT table_name,
  blocks,
  num_rows
FROM user_tables
WHERE table_name LIKE '%ORG'
ORDER BY 1;

TABLE_NAME                         BLOCKS   NUM_ROWS
------------------------------ ---------- ----------
DISORG                                853      57886 
ORG                                   853      57886 

--Compare CF
SELECT table_name,
  index_name,
  clustering_factor
FROM user_indexes
WHERE table_name LIKE '%ORG'
ORDER BY 1;

TABLE_NAME                     INDEX_NAME                     CLUSTERING_FACTOR
------------------------------ ------------------------------ -----------------
DISORG                         DISORG_IDX                                 57819 
ORG                            ORG_IDX                                      831 

--Compare costs
SELECT /*+ index( org  org_idx) */
  COUNT(subobject_name) FROM org;

Plan hash value: 3244275012
 
----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |     1 |    17 |  1129   (1)| 00:00:14 |
|   1 |  SORT AGGREGATE              |         |     1 |    17 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| ORG     | 57886 |   960K|  1129   (1)| 00:00:14 |
|   3 |    INDEX FULL SCAN           | ORG_IDX | 57886 |       |   297   (1)| 00:00:04 |
----------------------------------------------------------------------------------------

SELECT /*+ index( disorg  disorg_idx) */
  COUNT(subobject_name) FROM disorg;

Plan hash value: 4030431232
 
-------------------------------------------------------------------------------------------
| Id  | Operation                    | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |            |     1 |    17 | 58137   (1)| 00:11:38 |
|   1 |  SORT AGGREGATE              |            |     1 |    17 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| DISORG     | 57886 |   960K| 58137   (1)| 00:11:38 |
|   3 |    INDEX FULL SCAN           | DISORG_IDX | 57886 |       |   297   (1)| 00:00:04 |
-------------------------------------------------------------------------------------------

--Compare Explain plans
SELECT * FROM org WHERE object_name LIKE 'F%';

Plan hash value: 3677241234
 
---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |   114 | 11172 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ORG     |   114 | 11172 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | ORG_IDX |   114 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("OBJECT_NAME" LIKE 'F%')
       filter("OBJECT_NAME" LIKE 'F%')

SELECT * FROM disorg WHERE object_name LIKE 'F%';

Plan hash value: 220290145
 
------------------------------------------------------------------------------------------
| Id  | Operation                   | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |            |   114 | 11172 |   117   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DISORG     |   114 | 11172 |   117   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | DISORG_IDX |   114 |       |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("OBJECT_NAME" LIKE 'F%')
       filter("OBJECT_NAME" LIKE 'F%')

SELECT * FROM org WHERE object_name LIKE 'A%';

Plan hash value: 3677241234
 
---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |  1221 |   116K|    26   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| ORG     |  1221 |   116K|    26   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | ORG_IDX |  1221 |       |     8   (0)| 00:00:01 |
---------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("OBJECT_NAME" LIKE 'A%')
       filter("OBJECT_NAME" LIKE 'A%')

SELECT * FROM disorg WHERE object_name LIKE 'A%';

Plan hash value: 761413676
 
----------------------------------------------------------------------------
| Id  | Operation         | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |  1221 |   116K|   234   (1)| 00:00:03 |
|*  1 |  TABLE ACCESS FULL| DISORG |  1221 |   116K|   234   (1)| 00:00:03 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("OBJECT_NAME" LIKE 'A%')

Bulk Update Workshop

/*
Scenario:
Need to update few million records with most efficient method

Source: Asktom
*/


SQL> set autotrace traceonly statistics;

--Method 1: Simple update
SQL> UPDATE emp SET NAME = lower(NAME);

3560640 rows updated.


Statistics
----------------------------------------------------------
       2402  recursive calls
    3706506  db block gets
    3623335  consistent gets
      64946  physical reads
 1183926564  redo size
        680  bytes sent via SQL*Net to client
        602  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
    3560640  rows processed

SQL> rollback;

Rollback complete.

--Method 2: CTAS
SQL> CREATE TABLE emp1 AS
  2  SELECT * FROM emp
  3  WHERE 1 = 0;

Table created.

SQL> INSERT INTO emp1
  2  SELECT ID, lower(NAME) NAME, ROLE, dep FROM emp;

3560640 rows created.


Statistics
----------------------------------------------------------
       3846  recursive calls
     221947  db block gets
      77344  consistent gets
      11259  physical reads
  211246960  redo size
        686  bytes sent via SQL*Net to client
        633  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
         26  sorts (memory)
          0  sorts (disk)
    3560640  rows processed

SQL> rollback;

Rollback complete.


--Method 3: CTAS with direct path load
SQL> INSERT /*+ append */ INTO emp1
  2  SELECT ID, lower(NAME) NAME, ROLE, dep FROM emp;

3560640 rows created.


Statistics
----------------------------------------------------------
       1120  recursive calls
      26560  db block gets
      27797  consistent gets
      20869  physical reads
     148436  redo size
        669  bytes sent via SQL*Net to client
        647  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
         15  sorts (memory)
          0  sorts (disk)
    3560640  rows processed

SQL> rollback;

Rollback complete.


--Method 4: CTAS with direct path load & No logging
SQL> alter table emp1 nologging;

Table altered.

SQL> INSERT /*+ append */ INTO emp1
  2  SELECT ID, lower(NAME) NAME, ROLE, dep FROM emp;

3560640 rows created.


Statistics
----------------------------------------------------------
        202  recursive calls
      25310  db block gets
      25065  consistent gets
      20834  physical reads
      41620  redo size
        670  bytes sent via SQL*Net to client
        647  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          7  sorts (memory)
          0  sorts (disk)
    3560640  rows processed

SQL> rollback;

Rollback complete.

--Effect of index
SQL> CREATE INDEX emp1_idx ON emp1(id);

Index created.

SQL> INSERT /*+ append */ INTO emp1
  2  SELECT ID, lower(NAME) NAME, ROLE, dep FROM emp;

3560640 rows created.


Statistics
----------------------------------------------------------
       3066  recursive calls
     216085  db block gets
      40880  consistent gets
      48613  physical reads
  237328904  redo size
        671  bytes sent via SQL*Net to client
        647  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          3  sorts (memory)
          1  sorts (disk)
    3560640  rows processed

SQL> rollback;

Rollback complete.


--Make index unusable
SQL> ALTER INDEX emp1_idx UNUSABLE;

Index altered.

SQL> ALTER SESSION SET skip_unusable_indexes=TRUE;

Session altered.

SQL> INSERT /*+ append */ INTO emp1
  2  SELECT ID, lower(NAME) NAME, ROLE, dep FROM emp;

3560640 rows created.


Statistics
----------------------------------------------------------
        217  recursive calls
      25110  db block gets
      25071  consistent gets
      24141  physical reads
      41224  redo size
        672  bytes sent via SQL*Net to client
        647  bytes received via SQL*Net from client
          3  SQL*Net roundtrips to/from client
          9  sorts (memory)
          0  sorts (disk)
    3560640  rows processed

--Rebuild index
SQL> ALTER INDEX emp1_idx REBUILD NOLOGGING;

Index altered.


--Change Table & Index names 
SQL> DROP TABLE emp;

Table dropped.

SQL> ALTER TABLE emp1 RENAME TO emp;

Table altered.

SQL> ALTER INDEX emp1_idx RENAME TO emp_idx;

Index altered.

Oracle External Table Workshop

--set up default directories
CREATE OR REPLACE DIRECTORY data_upload AS 'E:\app\data\upload';

--grant read/write access
GRANT READ, WRITE ON DIRECTORY data_upload TO test;

--Place Employees.txt to 'E:\app\data\upload'
100,Steven,King,SKING,515.123.4567,17-JUN-03
101,Neena,Kochhar,NKOCHHAR,515.123.4568,21-SEP-05
102,Lex,De Haan,LDEHAAN,515.123.4569,13-JAN-01
103,Alexander,Hunold,AHUNOLD,590.423.4567,03-JAN-06
104,Bruce,Ernst,BERNST,590.423.4568,21-MAY-07


--create external table
CREATE TABLE employees_ext (
  emp_id          NUMBER(6),
  first_name      VARCHAR2(20),
  last_name     VARCHAR2(25),
  email           VARCHAR2(25),
  phone_number    VARCHAR2(20),
  hire_date       DATE
)
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY data_upload
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY NEWLINE
    FIELDS TERMINATED BY ','
    MISSING FIELD VALUES ARE NULL
    (
      emp_id          CHAR(6),
      first_name      CHAR(20),
      last_name       CHAR(25),
      email           CHAR(25),
      phone_number    CHAR(20),
      hire_date       CHAR(10) date_format DATE mask "dd-MON-yyyy"
    )
  )
  LOCATION ('Employees.txt')
)
PARALLEL 5
REJECT LIMIT UNLIMITED;

--query external table
SELECT *
FROM   employees_ext
ORDER BY emp_id;

--create traditional table
CREATE TABLE employees (
  emp_id          NUMBER(6),
  first_name      VARCHAR2(20),
  last_name     VARCHAR2(25),
  email           VARCHAR2(25),
  phone_number    VARCHAR2(20),
  hire_date       DATE
);

--load data from external table
INSERT INTO employees 
SELECT * FROM employees_ext;

--query employees table
SELECT *
FROM   employees
ORDER BY emp_id;

Oracle Join Methods Workshop

--to get current optimizer version
SELECT NAME ||': '||VALUE opt_ver
FROM v$parameter 
WHERE NAME = 'optimizer_features_enable';

optimizer_features_enable: 11.2.0.1

--to set the current value
ALTER SESSION SET optimizer_features_enable='10.2.0.1';

--to generate the explain plan
explain plan for
;

--to display the explain plan
SELECT * FROM TABLE(dbms_xplan.display);


--Nested Loop Joins

explain plan for
SELECT e.first_name,
  e.last_name,
  e.salary,
  d.department_name
FROM hr.employees e,
  hr.departments d
WHERE d.department_name IN ('Marketing', 'Sales')
AND e.department_id      = d.department_id;


optimizer_features_enable: 11.2.0.1
Plan hash value: 1021246405
 
--------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |    19 |   722 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |                   |       |       |            |          |
|   2 |   NESTED LOOPS               |                   |    19 |   722 |     4   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL         | DEPARTMENTS       |     2 |    32 |     3   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |    10 |       |     0   (0)| 00:00:01 |
|   5 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    10 |   220 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - filter("D"."DEPARTMENT_NAME"='Marketing' OR "D"."DEPARTMENT_NAME"='Sales')
   4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")


optimizer_features_enable: 10.2.0.1
Plan hash value: 2912831499
 
-------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                   |    19 |   722 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    10 |   220 |     1   (0)| 00:00:01 |
|   2 |   NESTED LOOPS              |                   |    19 |   722 |     4   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL        | DEPARTMENTS       |     2 |    32 |     3   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN         | EMP_DEPARTMENT_IX |    10 |       |     0   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - filter("D"."DEPARTMENT_NAME"='Marketing' OR "D"."DEPARTMENT_NAME"='Sales')
   4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")


optimizer_features_enable: 9.2.0.8
Plan hash value: 2912831499
 
---------------------------------------------------------------------------------
| Id  | Operation                   | Name              | Rows  | Bytes | Cost  |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                   |    19 |   722 |     4 |
|   1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    10 |   220 |     1 |
|   2 |   NESTED LOOPS              |                   |    19 |   722 |     4 |
|*  3 |    TABLE ACCESS FULL        | DEPARTMENTS       |     2 |    32 |     2 |
|*  4 |    INDEX RANGE SCAN         | EMP_DEPARTMENT_IX |    10 |       |       |
---------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - filter("D"."DEPARTMENT_NAME"='Marketing' OR 
              "D"."DEPARTMENT_NAME"='Sales')
   4 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
 
Note
-----
   - cpu costing is off (consider enabling it)


--Hash Joins

explain plan for
SELECT o.customer_id, l.unit_price * l.quantity
  FROM oe.orders o ,oe.order_items l
 WHERE l.order_id = o.order_id;

optimizer_features_enable: 11.2.0.1
Plan hash value: 2100987247
 
--------------------------------------------------------------------------------------------
| Id  | Operation               | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                  |   665 | 13300 |     6  (17)| 00:00:01 |
|*  1 |  HASH JOIN              |                  |   665 | 13300 |     6  (17)| 00:00:01 |
|   2 |   VIEW                  | index$_join$_001 |   105 |   840 |     3  (34)| 00:00:01 |
|*  3 |    HASH JOIN            |                  |       |       |            |          |
|   4 |     INDEX FAST FULL SCAN| ORDER_PK         |   105 |   840 |     1   (0)| 00:00:01 |
|   5 |     INDEX FAST FULL SCAN| ORD_CUSTOMER_IX  |   105 |   840 |     1   (0)| 00:00:01 |
|   6 |   TABLE ACCESS FULL     | ORDER_ITEMS      |   665 |  7980 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("L"."ORDER_ID"="O"."ORDER_ID")
   3 - access(ROWID=ROWID)


optimizer_features_enable: 10.2.0.1
Plan hash value: 2318691960
 
--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |   665 | 13300 |     6  (17)| 00:00:01 |
|   1 |  MERGE JOIN                  |             |   665 | 13300 |     6  (17)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| ORDERS      |   105 |   840 |     2   (0)| 00:00:01 |
|   3 |    INDEX FULL SCAN           | ORDER_PK    |   105 |       |     1   (0)| 00:00:01 |
|*  4 |   SORT JOIN                  |             |   665 |  7980 |     4  (25)| 00:00:01 |
|   5 |    TABLE ACCESS FULL         | ORDER_ITEMS |   665 |  7980 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("L"."ORDER_ID"="O"."ORDER_ID")
       filter("L"."ORDER_ID"="O"."ORDER_ID")
 

optimizer_features_enable: 9.2.0.8
Plan hash value: 864676608
 
------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |   665 | 13300 |     5 |
|*  1 |  HASH JOIN         |             |   665 | 13300 |     5 |
|   2 |   TABLE ACCESS FULL| ORDERS      |   105 |   840 |     2 |
|   3 |   TABLE ACCESS FULL| ORDER_ITEMS |   665 |  7980 |     2 |
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("L"."ORDER_ID"="O"."ORDER_ID")
 
Note
-----
   - cpu costing is off (consider enabling it)


--Sort Merge Joins

explain plan for
SELECT e.first_name,
  e.last_name,
  e.salary,
  d.department_name
FROM hr.employees e,
  hr.departments d
WHERE d.department_name IN ('Marketing', 'Sales')
AND e.department_id      >= d.department_id;


optimizer_features_enable: 11.2.0.1
Plan hash value: 1343509718
 
--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |    47 |  1786 |     6  (17)| 00:00:01 |
|   1 |  MERGE JOIN                  |             |    47 |  1786 |     6  (17)| 00:00:01 |
|*  2 |   TABLE ACCESS BY INDEX ROWID| DEPARTMENTS |     2 |    32 |     2   (0)| 00:00:01 |
|   3 |    INDEX FULL SCAN           | DEPT_ID_PK  |    27 |       |     1   (0)| 00:00:01 |
|*  4 |   SORT JOIN                  |             |   107 |  2354 |     4  (25)| 00:00:01 |
|   5 |    TABLE ACCESS FULL         | EMPLOYEES   |   107 |  2354 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("D"."DEPARTMENT_NAME"='Marketing' OR "D"."DEPARTMENT_NAME"='Sales')
   4 - access("E"."DEPARTMENT_ID">="D"."DEPARTMENT_ID")
       filter("E"."DEPARTMENT_ID">="D"."DEPARTMENT_ID")


optimizer_features_enable: 10.2.0.1
Plan hash value: 1343509718
 
--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |    11 |   418 |     6  (17)| 00:00:01 |
|   1 |  MERGE JOIN                  |             |    11 |   418 |     6  (17)| 00:00:01 |
|*  2 |   TABLE ACCESS BY INDEX ROWID| DEPARTMENTS |     2 |    32 |     2   (0)| 00:00:01 |
|   3 |    INDEX FULL SCAN           | DEPT_ID_PK  |    27 |       |     1   (0)| 00:00:01 |
|*  4 |   SORT JOIN                  |             |   107 |  2354 |     4  (25)| 00:00:01 |
|   5 |    TABLE ACCESS FULL         | EMPLOYEES   |   107 |  2354 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("D"."DEPARTMENT_NAME"='Marketing' OR "D"."DEPARTMENT_NAME"='Sales')
   4 - access("E"."DEPARTMENT_ID">="D"."DEPARTMENT_ID")
       filter("E"."DEPARTMENT_ID">="D"."DEPARTMENT_ID")



optimizer_features_enable: 9.2.0.8
Plan hash value: 2968905875
 
------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost  |
------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |    11 |   418 |     5 |
|   1 |  NESTED LOOPS      |             |    11 |   418 |     5 |
|*  2 |   TABLE ACCESS FULL| DEPARTMENTS |     2 |    32 |     2 |
|*  3 |   TABLE ACCESS FULL| EMPLOYEES   |     5 |   110 |     2 |
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("D"."DEPARTMENT_NAME"='Marketing' OR 
              "D"."DEPARTMENT_NAME"='Sales')
   3 - filter("E"."DEPARTMENT_ID">="D"."DEPARTMENT_ID")
 
Note
-----
   - cpu costing is off (consider enabling it)