Word counting source program based on MapReduce framework

CK1820 
Created at Feb 28, 2012 09:09:34
Updated at Dec 17, 2023 14:31:35 
125   0   0   0  

The MapReduce framework operates exclusively on  pairs, that is, the framework views the input to the job as a set of paris and produces a set of pairs as the output of the job, conceivably of different types. 

Below is the simple application that counts the number of occurences of each word in a given input set. In addition, this works with a local-standalone, pseudo-distributed or fully-distributed Hadoop installation. 

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {
	public static class Map extends MapReduceBase implements Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
      }
    }
  }

  public static class Reduce extends MapReduceBase implements Reducer {
    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }

  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);
    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);
  }
}

 

Usage 

Assuming HADOO_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar: 

$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/

 

Assuming that: 

  • /usr/joe/wordcount/input - input directory in HDFS
  • /usr/joe/wordcount/output - output directory in HDFS


Sample text-files as input: 

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop

Run the application: 

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

Output: 

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

Applications can specify a comma separated list of paths which would be present in the current working directory of the task using the option -files. The -libjars option allows applications to add jars to the classpaths of the maps and reduces. The option -archives allows them to pass comma separated list of archives as arguments. These archives are unarchived and a link with name of the archive is created in the current working directory of tasks. More details about the command line options are available at Commands Guide. 

Running wordcount example with -libjars, -files and -archives: 
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars mylib.jar -archives myarchive.zip input output Here, myarchive.zip will be placed and unzipped into a directory by the name "myarchive.zip". 

Users can specify a different symbolic name for files and archives passed through -files and -archives option, using #. 

For example, hadoop jar hadoop-examples.jar wordcount -files dir1/dict.txt#dict1,dir2/dict.txt#dict2 -archives mytar.tgz#tgzdir input output Here, the files dir1/dict.txt and dir2/dict.txt can be accessed by tasks using the symbolic names dict1 and dict2 respectively. The archive mytar.tgz will be placed and unarchived into a directory by the name "tgzdir". 

Reference 
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Inputs+and+Outputs



Tags: Cloud Computing DB Optimization Display Framework MapReduce framework auto-repair cron.hourly Share on Facebook Share on X

◀ PREVIOUS
The leaders who created Smart TV
▶ NEXT
Fitness Hub presents your personal trainer, Fitness VOD on Samsung Smart TV
  Comments 0
Login for comment
SIMILAR POSTS

Copy text to clipboard in Javascript (created at May 07, 2013)

How to resize image ? (updated at Dec 19, 2023)

How to set and get sessions on JSP code ? (created at Oct 27, 2008)

How to add a declaration on JSP code ? (created at Oct 27, 2008)

How to include file in JSP ? (created at Oct 27, 2008)

How to print out text strings on HTML directly ? (created at Oct 27, 2008)

Put Date Time on HTML (created at Oct 27, 2008)

Send An Email Using A Bean (created at Aug 30, 2007)

JQuery And Function Chaining (created at Aug 30, 2007)

OTHER POSTS IN THE SAME CATEGORY

Exploring the Depths of Data Transfer: sendfile vs. kTLS (created at Mar 15, 2024)

How Netflix Ensures Smooth Streaming with Open Connect CDN (updated at Mar 15, 2024)

Public DNS (Domain Name Service) based on IPv4, IPv6 widely used (updated at Feb 23, 2024)

All Engineering Software Development How can you prioritize software design trade-offs when developing a new product? (created at Feb 21, 2024)

AI-based Image Creation based on Bing Image Creator (updated at Feb 17, 2024)

ChatGPT App (flowGPT) to create images with text (updated at Feb 17, 2024)

MR(Mixed Reality) Game Programming based on Unity 3D (created at Feb 01, 2024)

Quiz : Twice counter (created at Jul 14, 2017)

FNC(Photo/Video File Name Changer) v1.1 Release - Date Time Bug Fix & Retry Function Added (created at Jun 11, 2013)

Smart TV Cycling Apps – TV for Fitness (created at Apr 24, 2013)

The public could reduce server cost, and enhance performance for global service distribution (created at Jun 17, 2012)

Cloud gaming coming to Samsung Smart TVs (created at Jun 05, 2012)

Kids App – Kids Learn While Parents Control on Samsung Smart TV (created at May 13, 2012)

Fitness Hub presents your personal trainer, Fitness VOD on Samsung Smart TV (created at Apr 10, 2012)

The leaders who created Smart TV (created at Nov 01, 2011)

FNC(Photo/Video File Name Changer) v1.02 release - The file created date time information is available for those cameras providing incorrect EXIF data (updated at Dec 20, 2023)

Samsung Apps hits 5 million Smart TV app downloads (created at May 23, 2011)

Your Video – VoD content recommendation service on Samsung Smart TV (created at Mar 15, 2011)

Amazon S3 File Explorer : You can use S3 like Windows File Explorer (created at Oct 02, 2010)

Where to get AWS Access Key ? (updated at Dec 21, 2023)

FNC(Photo/Video File Name Changer) v1.01 release (updated at Dec 20, 2023)

FNC(Photo/Video File Name Changer) v1.0 release (updated at Dec 20, 2023)

How to search file on certain directory ? (created at Jun 15, 2008)

SQL to Select a random row from a database table (created at Jan 07, 2008)

DataPlay – Change the world with brand new media (created at Oct 16, 2001)

Content Manager for The world’s first MP3 Phone (created at Jul 31, 2000)

yepp Explorer – File Management Application for MP3 Device (created at Feb 21, 2000)

MusicDrive – DRM enabled Secure Music Player on Microsoft Windows (created at Feb 16, 2000)

UPDATES

Creating a Pinterest-Style Card Layout with Bootstrap and Masonry (created at Apr 24, 2024)

Mastering Excel Data Importation in PHP (updated at Apr 24, 2024)

JSON format control in PHP (updated at Apr 24, 2024)

Equal Height Blocks in Bootstrap with JavaScript (created at Apr 22, 2024)

How to convert integer to text string ? (updated at Apr 22, 2024)

Checking similarity between two strings in PHP (updated at Apr 21, 2024)

Create Blob Image in HTML based on the given Text, Width and Height in the Center of the Image without saving file (updated at Apr 21, 2024)

How do I determine the client IP type (IPv4/IPv6) in PHP (updated at Apr 16, 2024)

How do I determine the client IP type in Python - IPv4 or IPv6 (updated at Apr 13, 2024)

Getting Started with PyTorch: A Beginner's Guide to Building Your First Neural Network (updated at Apr 09, 2024)

Predicting Buyer Preferences with PyTorch: A Deep Learning Approach (updated at Apr 09, 2024)

Forecasting the Weather with PyTorch: A Beginner's Guide to Temperature Prediction (created at Apr 09, 2024)

PyTorch example to Forcast Stock Price based on 10 days Dataset (created at Apr 09, 2024)

Mastering Model Persistence: Saving and Loading Trained Machine Learning Models in Python (created at Apr 08, 2024)

Harnessing the Power of Random Forest Algorithm in Python (created at Apr 08, 2024)

Understanding and Implementing K-Nearest Neighbors (KNN) Algorithm in Python (created at Apr 08, 2024)

Forecasting with Linear Regression and KNN Regression in Python (updated at Apr 07, 2024)

What is 302 Found Redirection in HTTP 1.1? (created at Apr 04, 2024)

Mastering Random Forest Regression: A Comprehensive Guide with Python Examples (updated at Apr 01, 2024)

Python Implementation of Linear Regression (updated at Apr 01, 2024)

Mastering Supervised Machine Learning with Python: A Comprehensive Guide (created at Apr 01, 2024)

Mastering AI: A Beginner's Guide to Python Programming and Beyond (created at Apr 01, 2024)

How do I create animated background for Google Meet? (updated at Mar 28, 2024)

Building a Simple DNS Server in Delphi with TTL Support (created at Mar 16, 2024)

How to force cookies, disable php sessid in URL ? (updated at Mar 16, 2024)

Implementing a Versatile DNS Server in PHP: Handling A, AAAA, CNAME, and TXT Records (updated at Mar 16, 2024)

Implementing a Versatile DNS Server in Python: Handling A, AAAA, CNAME, and TXT Records (created at Mar 16, 2024)

Building a Basic DNS Server in PHP/Python: A Beginner's Guide (updated at Mar 15, 2024)

Dynamic DNS Made Easy: Building a Python-Based Solution (created at Mar 15, 2024)

Exploring the Depths of Data Transfer: sendfile vs. kTLS (created at Mar 15, 2024)