EXPERIMENT: WORD COUNT IN HADOOP (USING HDFS)
1. Create a Java Project in Eclipse
Open Eclipse IDE
Go to File → New → Java Project
Name the project: wordCountJob
2. Add Classes to the Project
Right-click on wordCountJob → New → Class
Create the following three classes:
i. WordCount.java
ii. WordMapper.java
iii. WordReducer.java
3. Add Code to Each File
Prog 1: WordCount.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("wordCount");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(WordReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
Prog 2: WordMapper.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Prog 3: WordReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
4. Add External Hadoop Libraries
Right-click on wordCountJob → Build Path → Configure Build Path
Go to Libraries → Add External JARs
Navigate to:
/usr/lib/hadoop/
Add core Hadoop jars (like hadoop-core-1.2.1.jar or similar)
Also include JARs from /usr/lib/hadoop/lib/ if required
5. Export Project to JAR File
Right-click on the project → Export
Choose Java → JAR File
Click Next
Choose export destination (e.g., Desktop or src folder)
Name the file: wordCount.jar
Click Finish
6. Copy JAR and Sample File to Workspace (Linux Terminal)
cd ~/workspace/wordCountJob/src
cp /path/to/wordCount.jar ./
//create sample.txt
Hello world
I am eshaan vaswani
I am from BE COMPS TSEC
7. Upload Input File to HDFS
hadoop fs -mkdir -p /user/training/hadoop_eshaan
hadoop fs -put sample.txt /user/training/hadoop_eshaan
hadoop fs -ls /user/training/hadoop_eshaan
8. Run the JAR File using Hadoop
[training@localhost src]$ hadoop jar wordCount.jar WordCount
/user/training/hadoop_eshaan/sample.txt output
25/08/04 23:20:45 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
25/08/04 23:20:45 INFO input.FileInputFormat: Total input paths to process : 1
25/08/04 23:20:45 WARN snappy.LoadSnappy: Snappy native library is available
25/08/04 23:20:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library
25/08/04 23:20:45 INFO snappy.LoadSnappy: Snappy native library loaded
25/08/04 23:20:45 INFO mapred.JobClient: Running job: job_202508010234_0006
25/08/04 23:20:46 INFO mapred.JobClient: map 0% reduce 0%
25/08/04 23:20:48 INFO mapred.JobClient: map 100% reduce 0%
25/08/04 23:20:55 INFO mapred.JobClient: map 100% reduce 100%
25/08/04 23:20:55 INFO mapred.JobClient: Job complete: job_202508010234_0006
25/08/04 23:20:55 INFO mapred.JobClient: Counters: 22
25/08/04 23:20:55 INFO mapred.JobClient: Job Counters
25/08/04 23:20:55 INFO mapred.JobClient: Launched reduce tasks=1
25/08/04 23:20:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1318
25/08/04 23:20:55 INFO mapred.JobClient: Total time spent by all reduces waiting after
reserving slots (ms)=0
25/08/04 23:20:55 INFO mapred.JobClient: Total time spent by all maps waiting after
reserving slots (ms)=0
25/08/04 23:20:55 INFO mapred.JobClient: Launched map tasks=1
25/08/04 23:20:55 INFO mapred.JobClient: Data-local map tasks=1
25/08/04 23:20:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=7519
25/08/04 23:20:55 INFO mapred.JobClient: FileSystemCounters
25/08/04 23:20:55 INFO mapred.JobClient: FILE_BYTES_READ=134
25/08/04 23:20:55 INFO mapred.JobClient: HDFS_BYTES_READ=176
25/08/04 23:20:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=110360
25/08/04 23:20:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=71
25/08/04 23:20:55 INFO mapred.JobClient: Map-Reduce Framework
25/08/04 23:20:55 INFO mapred.JobClient: Reduce input groups=10
25/08/04 23:20:55 INFO mapred.JobClient: Combine output records=0
25/08/04 23:20:55 INFO mapred.JobClient: Map input records=3
25/08/04 23:20:55 INFO mapred.JobClient: Reduce shuffle bytes=134
25/08/04 23:20:55 INFO mapred.JobClient: Reduce output records=10
25/08/04 23:20:55 INFO mapred.JobClient: Spilled Records=24
25/08/04 23:20:55 INFO mapred.JobClient: Map output bytes=104
25/08/04 23:20:55 INFO mapred.JobClient: Combine input records=0
25/08/04 23:20:55 INFO mapred.JobClient: Map output records=12
25/08/04 23:20:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=120
25/08/04 23:20:55 INFO mapred.JobClient: Reduce input records=12
9. View Output
[training@localhost src]$ hadoop fs -ls output2
Found 3 items
-rw-r--r-- 1 training supergroup 0 2025-08-04 23:20 /user/training/output2/_SUCCESS
drwxr-xr-x - training supergroup 0 2025-08-04 23:20 /user/training/output2/_logs
-rw-r--r-- 1 training supergroup 71 2025-08-04 23:20
/user/training/output2/part-r-00000
[training@localhost src]$ hadoop fs -cat output2/part-r-00000
am 2
be 1
comps 1
eshaan 1
from 1
hello 1
i 2
tsec 1
vaswani 1
world 1