BDA Lab Manual R22
BDA Lab Manual R22
Course Outcomes
1. Use Excel as an Analytical tool and visualization tool.
2. Ability to program using HADOOP and Map reduce.
3. Ability to perform data analytics using ML in R.
4. Use cassandra to perform social media analytics.
List of Experiments
1. Create a Hadoop cluster
2. Implement a simple map-reduce job that builds an inverted index on the set of input
documents (Hadoop)
3. Process big data in HBase
4. Store and retrieve data in Pig
5. Perform data analysis using MongoDB
6. Using Power Pivot (Excel) Perform the following on any dataset
a) Big Data Analytics
b) Big Data Charting
7. Use R-Project to carry out statistical analysis of big data
8. Use R-Project for data visualization
TEXT BOOKS:
1. Big Data Analytics, Seema Acharya, Subhashini Chellappan, Wiley 2015.
2. Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s
Business, Michael Minelli, Michehe Chambers, 1st Edition, Ambiga Dhiraj, Wiely CIO
Series, 2013.
3. Hadoop: The Definitive Guide, Tom White, 3rd Edition, O‟Reilly Media, 2012.
4. Big Data Analytics: Disruptive Technologies for Changing the Game, Arvind Sathi, 1st
Edition,
IBM Corporation, 2012.
REFERENCES:
1. Big Data and Business Analytics, Jay Liebowitz, Auerbach Publications, CRC press (2013).
2. Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise
and Oracle R Connector for Hadoop, Tom Plunkett, Mark Hornick, McGraw-Hill/Osborne
Media (2013), Oracle press.
3. Professional Hadoop Solutions, Boris lublinsky, Kevin t. Smith, Alexey Yakubovich,
Wiley, ISBN: 9788126551071, 2015.
4. Understanding Big data, Chris Eaton, Dirk deroos et al., McGraw Hill, 2012.
5. Intelligent Data Analysis, Michael Berthold, David J. Hand, Springer, 2007.
PROGRAM 1
INSTALLING HADOOP IN WINDOWS 10
Preparations
A. Make sure that you are using Windows 10 and are logged in as admin.
G. Run the Java installation file jdk-8u191-windows-x64. Install direct in the folder C:\Java, or
move the items from the folder jdk1.8.0 to the folder C:\Java. It should look like this:
A new window will open with two tables and buttons. The upper table is for User variables
and the lower for System variables.
B. Make another New User variable [1]. Name it HADOOP_HOME and set it to the
hadoop-2.8.0 bin-folder [2]. Click OK [3].
C. Now add Java and Hadoop to System variables path: Go to path [1] and click edit [2]. The
editor window opens. Chose New [3] and add the address C:\Java\bin [4]. Chose New again
[5] and add the address C:\hadoop-2.8.0\bin [6]. Click OK [7] in the editor window and OK
[8] to change the System variables.
Configuration
A. Go to the file C:\Hadoop\Hadoop-2.8.0\etc\hadoop\core-site.xml [1]. Right-click on the file
and edit with Notepad++ [2].
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
D. Under C:\Hadoop-2.8.0 create a folder named data [1] with two subfolders, “datanode” and
“namenode” [2].
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
F. Edit the file C:\Hadoop-2.8.0\etc\hadoop\yarn-site.xml with Notepad++. Paste the following
code between the configuration tags and save:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
B. Delete the bin file C:\Hadoop\Hadoop-2.8.0\bin [1, 2] and replace it with the new
bin-folder from Hadoop Configuration.zip. [3].
Testing
A. Search for cmd [1] and open the Command Prompt [2]. Write
hdfs namenode –format [3] and push enter.
If this first test works the Command Prompt will run a lot of information. It is a good sign!
B. Now you must change directory in the Command Prompt. Write cd C:\hadoop-2.8.0\sbin and
push enter. In the sbin folder, write start-all.cmd and push enter.
If the configuration is right, four apps will start running and it will look something like this:
C. Now open a browser and write in the address field localhost:8088 and push enter.
Can you see the little hadoop elephant? Then you have made a really good work!
***********************
To close the running programs, run “stop-all.cmd” in tho command prompt
1. Write the steps about hadoop installation in windows 10
Preparations
A. Check that you are logged in as admin.
B. Download Java jdk1.8.0
C. Download Hadoop 2.8.0
D. Download Notepad++
E. Create a folder C:\Java
F. Install Java jdk1.8.0 in the folder C:\Java
G. Install Hadoop 2.8.0 right under C:\
H. Install Notepad++
I. If the Windows firewall is activated open ports 8088 and 50070
Configuration
A. Edit the file C:\Hadoop\Hadoop-2.8.0\etc\hadoop\core-site.xml.
B. Rename C:\Hadoop-2.8.0\etc/hadoop\mapred-site.xml.template to mapred-site.xml and
edit.
C. C:\Hadoop-2.8.0 create a folder named “data” with subfolders “datanode” and
“namenode”.
D. Edit the file C:\Hadoop-2.8.0\etc\hadoop/hdfs-site.xml.
E. Edit the file C:\Hadoop-2.8.0\etc\hadoop\yarn-site.xml.
F. Edit the file C:\Hadoop-2.8.0\etc/hadoop\hadoop-env.cmd
Aim: To perform a program for simple Word count MapReduce with hadoop using Java in
Windows 10
Prerequisites:
This program deals with the basic MapReduce program using the Apache Hadoop framework
in windows computers. The Hadoop framework installed is Pseudo Distributed (Single node).
This tutorial deals with running the MapReduce program on windows. Hadoop single node
framework and JDK with eclipse are already installed.
Java JDK version "1.8.0_291" and Hadoop version "3.3.0" are used here; the operations
resemble similar to other versions also.
Create text file with some content for word count.
Procedure:
Initially, open eclipse IDE and create a new project. Here the project name is
map_reduce_example.
Now use configure build path in your project and add the following external Hadoop jar files
to the project as shown below.
After successfully adding the jar files, the eclipse will show items in referenced libraries, as
shown below.
Next, add source code for the word-count program.
Program:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
After adding source code, create a jar file of map_reduce_example using the export option in
eclipse IDE.
7. Now run the map_reduce jar file exported previously using the Hadoop
command (hadoop jar jarpath/jar_name.txt /input_directory
/output_directory) --- leave space between input directory and output
directory as shown.
Theory:
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable.
It is based on Google's Big Table.It has set of tables which keep data in key value format. Hbase is
well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs
enabling development in practically any programming language. It is a part of the Hadoop
ecosystem that provides random real-time read/write access to data in the Hadoop File System.
RDBMS get exponentially slow as the data becomes large
Expects data to be highly structured, i.e. ability to fit in a well-defined schema
Any change in schema might require a downtime
For sparse datasets, too much of overhead of maintaining NULL values
Features of Hbase
Horizontally scalable: You can add any number of columns anytime.
Automatic Failover: Automatic failover is a resource that allows a system administrator to
automatically switch data handling to a standby system in the event of system compromise
Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.
Often referred as a key value store or column family-oriented database, or storing versioned
maps of maps.
fundamentally, it's a platform for storing and retrieving data with random access.
It doesn't care about datatypes(storing an integer in one row and a string in another for
the same column).
It doesn't enforce relationships within your data.
It is designed to run on a cluster of computers, built using commodity hardware.
Hbase commands
Step 1:First go to terminal and type StartCDH.sh
Step 2:Next type jps command in the terminal
Step 5:hbase(main):001:0>version
lOMoARcPSD|53906646
hbase(main):011:0> create
'newtbl','knowledge'
hbase(main):011:0>describe 'newtbl'
hbase(main):011:0>status
1 servers, 0 dead, 15.0000 average load
ue=macroeconomics
r2 column=knowledge:music, timestamp=1678807877340, value=s
ongs
2 row(s) in 0.0250 seconds
data
Veri昀椀cation
After disabling the table, you can still sense its existence
through list and exists commands. You cannot scan it. It will give you the following
error.
hbase(main):028:0> scan 'newtbl'
ROW COLUMN + CELL
ERROR: newtbl is disabled.
is_disabled
This command is used to find whether a table is disabled. Its syntax is as follows.
hbase> is_disabled 'table name'
disable_all
This command is used to disable all the tables matching the given regex. The syntax
for disable_all command is given below.
hbase> disable_all 'r.*'
Suppose there are 5 tables in HBase, namely raja, rajani, rajendra, rajesh, and raju. The
following code will disable all the tables starting with raj.
hbase(main):002:07> disable_all 'raj.*'
raja
lOMoARcPSD|53906646
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Veri昀椀cation
After enabling the table, scan it. If you can see the schema, your table is successfully
enabled.
is_enabled
This command is used to find whether a table is enabled. Its syntax is as follows:
hbase> is_enabled 'table name'
The following code verifies whether the table named emp is enabled. If it is enabled, it
will return true and if not, it will return false.
hbase(main):031:0> is_enabled 'newtbl'
true
0 row(s) in 0.0440 seconds
describe
This command returns the description of the table. Its syntax is as follows:
hbase> describe 'table name'
Operator Description
LOAD To Load the data from the 昀椀le system (local/HDFS) into a
relation.
STORE To save a relation to the 昀椀le system (local/HDFS).
lOMoARcPSD|53906646
Filtering
Sorting
Diagnostic Operators
Employee Age
Name City
ID
001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai
Step-1: Create a Directoryin HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input 昀椀le of Pig contains each tuple/record in individual lines with the en琀椀琀椀es
separated by a delimiter ( “,”).
In the local file system, create an input In the local file system, create an
file student_data.txt containing data as input file employee_data.txt
shown below. containing data as shown below.
001,Jagruthi,21,Hyderabad,9.1 001,Angelina,22,LosAngeles
002,Praneeth,22,Chennai,8.6 002,Jackie,23,Beijing
003,Sujith,22,Mumbai,7.8 003,Deepika,22,Mumbai
004,Sreeja,21,Bengaluru,9.2 004,Pawan,24,Hyderabad
005,Mahesh,24,Hyderabad,8.8 005,Rajani,21,Chennai
006,Rohit,22,Chennai,7.8 006,Amitabh,22,Mumbai
007,Sindhu,23,Mumbai,8.3
Step-3: Move the 昀椀le from the local 昀椀le system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
To get the path of the 昀椀le student_data.txt type the below
command readlink -f student_data.txt
$ hdfs dfs -put /home/hadoop/Desktop/student_data.txt /bdalab/pigdir/
lOMoARcPSD|53906646
Step-5: Apply Rela琀椀onal Operator – STORE to Store the rela琀椀on in the HDFS directory
“/pig_output/” as shown below.
grunt> STORE student INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
Step-7: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – DUMP toPrint the contents of
the rela琀椀on.
Step-9: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – EXPLAIN toDisplay the logical,
physical, and MapReduce execu琀椀onplans of a rela琀椀on usingExplain operator
grunt> Explain student
grunt>Explain employee
lOMoARcPSD|53906646
Step-9: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – ILLUSTRATE to give the step-
by-step execu琀椀on of a sequence of statements
grunt>Illustrate student
grunt>Illustrate employee
PROGRAM 5:
Perform data analysis using MongoDB
MongoDB-
MongoDB is a source-available, cross-platform, document-oriented
database program. Classified as a NoSQL database product, MongoDB uses JSON-like
documents with optional schemas.
3. MapReduce Code
Below is the Java MapReduce code to calculate the total revenue for each item.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import com.mongodb.hadoop.MongoInputFormat;
import com.mongodb.hadoop.MongoOutputFormat;
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
How It Works
1. MongoInputFormat: Reads documents from the MongoDB collection (sales).
2. SalesMapper: Computes revenue for each item.
3. SalesReducer: Aggregates the revenue for each item across all records.
4. MongoOutputFormat: Saves the processed results back into a MongoDB collection
(result).
Aim: To perform the big data analytics using power pivot in Excel
Theory: Power Pivot is an Excel add-in you can use to perform powerful data analysis and create
sophisticated data models. With Power Pivot, you can mash up large volumes of data from
various sources, perform information analysis rapidly, and share insights easily.
In both Excel and in Power Pivot, you can create a Data Model, a collection of tables with
relationships. The data model you see in a workbook in Excel is the same data model you see in
the Power Pivot window. Any data you import into Excel is available in Power Pivot, and vice
versa.
Procedure:
Open the Microsoft Excel and go to data menu and click get data
lOMoARcPSD|53906646
Next click create connection and click the check box add to the data model
lOMoARcPSD|53906646
Next click manage data model and see that all the
twitter data is loaded as model and close the power
pivot window.
Click the diagram view and give the relation ships between the tables
lOMoARcPSD|53906646
Experiment 6:Using Power Pivot perform the following on any data set
Aim :To create variety of charts using Excel for the given data
Resources:Microsoft Excel
Theory:
When your data sets are big, you can use Excel Power Pivot that can handle
hundreds of millions of rows of data. The data can be in external data
sources and Excel Power Pivot builds a Data Model that works on a memory
optimization mode. You can perform the calculations, analyze the data and
arrive at a report to draw conclusions and decisions. The report can be either
as a Power PivotTable or Power PivotChart or a combination of both.
You can utilize Power Pivot as an ad hoc reporting and analytics solution.
Thus, it would be possible for a person with hands-on experience with Excel
to perform the high-end data analysis and decision making in a matter of few
minutes and are a great asset to be included in the dashboards.
Click the OK button. New worksheet gets created in Excel window and an empty Power
PivotTable appears.
lOMoARcPSD|53906646
As you can observe, the layout of the Power PivotTable is similar to that of PivotTable.
The PivotTable Fields List appears on the right side of the worksheet. Here, you will 昀椀nd
some di昀昀erences from PivotTable. The Power PivotTable Fields list has two tabs − ACTIVE
and ALL, that appear below the title and above the 昀椀elds list. ALL tab is highlighted. The ALL
tab displays all the data tables in the Data Model and ACTIVE tab displays all the data tables
that are chosen for the Power PivotTable at hand.
Click the table names in the PivotTable Fields list
under ALL. The corresponding 昀椀elds with check boxes will
appear.
Each table name will have the symbol on the left side.
If you place the cursor on this symbol, the Data Source and the Model Table Name of
that data table will be displayed.
The 昀椀eld Salesperson appears in the Power PivotTable and the table Salesperson appears
under ACTIVE tab.
Click the OK button. Sort the column labels in the ascending order.
Click on the Home tab on the Ribbon in the Power Pivot window.
Click on PivotTable.
Click on PivotChart in the dropdown list.
Click the OK button. An empty PivotChart gets created on a new worksheet in the
Excel window. In this chapter, when we say PivotChart, we are referring to Power
PivotChart.
As you can observe, all the tables in the data model are displayed in the PivotChart Fields list.
hidden.
Note that display of Field Buttons and/or Legend depends on the context of the PivotChart.
You need to decide what is required to be displayed.
As in the case of Power PivotTable, Power PivotChart Fields list also contains two tabs −
ACTIVE and ALL. Further, there are 4 areas −
AXIS (Categories)
lOMoARcPSD|53906646
LEGEND (Series)
∑ VALUES
FILTERS
As you can observe, Legend gets populated with ∑ Values. Further, Field Buttons get added
to the PivotChart for the ease of 昀椀ltering the data that is being displayed. You can click on
the arrow on a Field Button and select/deselect values to be displayed in the Power
PivotChart.
You can have the following Table and Chart Combinations in Power Pivot.
Chart and Table (Horizontal) - you can create a Power PivotChart and a Power
PivotTable, one next to another horizontally in the same worksheet.
lOMoARcPSD|53906646
Chart and Table (Vertical) - you can create a Power PivotChart and a Power PivotTable, one
below another vertically in the same worksheet.
These combinations and some more are available in the dropdown list that appears when
you click on PivotTable on the Ribbon in the Power Pivot window.
Click on the pivot chart and can develop multiple variety of charts
Output:
lOMoARcPSD|53906646
Procedure:
https://posit.co/download/rstudio-desktop/#download
step 2 :wget -c
https://download1.rstudio.org/desktop/jammy/amd64/rstudio
-2022.07.2-576-amd64.deb
step 4:rstudio
launch R studio
procedure:
-->install.packages("gapminder")
-->library(gapminder)
lOMoARcPSD|53906646
-->data(gapminder)
output:
A tibble: 1,704 × 6
-->summary(gapminder)
summary(gapminder)
output:
(Other) 1632
-->x<-mean(gapminder$gdpPercap)
-->x
output:[1] 7215.327
-->attach(gapminder)
-->median(pop)
output:[1] 7023596
-->hist(lifeExp)
lOMoARcPSD|53906646
-->boxplot(lifeExp)
will plot the below images
-->plot(lifeExp - gdpPercap)
-->install.packages("dplyr")
-->gapminder %>%
+ filter(year == 2007) %>%
+ group_by(continent) %>%
+ summarise(lifeExp = median(lifeExp))
output:
# A tibble: 5 × 2
continent lifeExp
<fct> <dbl>
1 Africa 52.9
2 Americas 72.9
3 Asia 72.4
4 Europe 78.6
5 Oceania 80.7
-->install.packages("ggplot2")
--> library("ggplot2")
-->ggplot(gapminder, aes(x = continent, y = lifeExp))
+ geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
lOMoARcPSD|53906646
output:
-->head(country_colors, 4)
output:
Nigeria Egypt Ethiopia
"#7F3B08" "#833D07" "#873F07"
Congo, Dem. Rep.
"#8B4107"
-->head(continent_colors)
mtcars
mpg cyl disp hp drat wt qsec vs a gear carb
m
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
lOMoARcPSD|53906646
> dim(Data_Cars)
[1] 32 11
> names(Data_Cars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "car
b"
> Data_Cars <- mtcars
> Data_Cars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> Data_Cars <- mtcars
> sort(Data_Cars$cyl)
[1] 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8
> Data_Cars <- mtcars
>
> summary(Data_Cars)
> mean(Data_Cars$wt)
[1] 3.21725
median(Data_Cars$wt)
lOMoARcPSD|53906646
[1] 3.325
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response vari-
able.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficient
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Result:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
lOMoARcPSD|53906646
print(summary(relation))
Result:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
lOMoARcPSD|53906646
Result:
1
76.22869
Bar Plot
There are two types of bar plots- horizontal and vertical which represent data points as
horizontal or vertical bars of certain lengths proportional to the value of the data item. They
are generally used for continuous and categorical variable plotting. By setting
the horiz parameter to true and false, we can get horizontal and vertical bar plots
respectively.
Histogram
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be
varied.
For a histogram, the parameter xlim can be used to specify the interval within which all
values are to be displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in the
histogram and when set to FALSE, the probability densities are represented on the y-axis such
that they are of the histogram adds up to one.
Histograms are used in the following scenarios:
To verify an equal and symmetric distribution of the data.
|
lOMoARcPSD|53906646
Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A boxplot
depicts information like the minimum and maximum data point, the median value, first and
third quartile, and interquartile range.
Box Plots are used for:
To give a comprehensive statistical description of the data through a visual cue.
To identify the outlier points that do not lie in the inter-quartile range of data.
Scatter Plot
A scatter plot is composed of many points on a Cartesian plane. Each point denotes the value
taken by two parameters and helps us easily identify the relationship between them.
Scatter Plots are used in the following scenarios:
To show whether an association exists between bivariate data.
To measure the strength and direction of such a relationship.
Heat Map
Heatmap is defined as a graphical representation of data using colors to visualize the value
of the matrix. heatmap() function is used to plot heatmap.
Syntax: heatmap(data)
Parameters: data: It represent matrix data, such as values of rows and columns
Return: This function draws a heatmap.
Procedure:
Step I : Facebook Developer Registration
Go to https://developers.facebook.com and register yourself by
clicking on Get Started button at the top right of page (See the
snapshot below). After it would open a form for registration which
you need to 昀椀ll it to get yourself registered.
lOMoARcPSD|53906646
Step2:click on tools
install.packages("h琀琀puv")
install.packages("Rfacebook")
install.packages("RcolorBrewer")
install.packages("Rcurl")
install.packages("rjson")
install.packages("h琀琀r")
library(Rfacebook)
library(h琀琀puv)
library(RcolorBrewer)
acess_token="EAATgfMOrIRoBAOR9XUl3VGzbLMuWGb9FqGkTK3PFBuRyUVZA
WAL7ZBw0xN3AijCsPiZBylucovck4YUhU昀欀WLMZBo640k2ZAupKgsaKog9736lec
P8E52qkl5de8M963oKG8KOCVUXqqLiRcI7yIbEONeQt0eyLI6LdoeZA65Hyxf8so1
UMbywAdZCZAQBpNiZAPPj7G3UX5jZAvUpRLZCQ5SIG"
op琀椀ons(RCurlop琀椀ons=list(verbose=FALSE,capath=system.昀椀le("CurlSSL","cacert.
pem",package = "Rcurl"),ssl.verifypeer=FALSE))
lOMoARcPSD|53906646
me<-getUsers("me",token=acess_token)
View(me)
myFriends<-getFriends(acess_token,simplify = FALSE)
table(myFriends)
pie(table(myFriends$gender))
output