Unix Scripting: A Tutorial For Computational Linguistics (CSE 506/606)

This document provides an overview and tutorial on using various Unix scripting tools for computational linguistics and natural language processing tasks. It covers the basics of sed, awk, python, bash, sort, and other common Unix tools; when and how to use each tool; and provides examples of applying the tools to typical NLP tasks like part-of-speech tagging and syntactic parsing. The document concludes by demonstrating how to combine multiple tools into a single script to accomplish a more complex task of extracting and processing text from files.

Uploaded by

anna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views12 pages

Unix Scripting: A Tutorial For Computational Linguistics (CSE 506/606)

Uploaded by

anna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Unix Scripting

A Tutorial for Computational Linguistics

(CSE 506/606)

Kristy Hollingshead
Fall 2009

www.cslu.ogi.edu/~hollingk/CL_tutorial.html

Overview
• The goal here is to make your lives easier!
• CL & NLP are very text-intensive
• Simple tools for text-manipulation
– sed
– awk
– python
– bash/tcsh
– sort
• When & how to use each of these tools

1
Regular expressions crash course
• [a-z] exactly one lowercase letter
• [a-z]* zero or more lowercase letters
• [a-z]+ one or more lowercase letters
• [a-zA-Z0-9] one lowercase or uppercase letter,
or a digit
• [^(] match anything that is not '('

sed: overview
• a stream editor
• WHEN
– "search-and-replace"
– great for using regular expressions to change
something in the text
• HOW
– sed 's/regexp/replacement/g'
• 's/… = substitute
• …/g' = global replace
(otherwise will only replace first occurrence on a line!)

2
sed: special characters
• ^ the start of a line…
except at the beginning of a character
set (e.g., [^a-z]), where it
complements the set
• $ the end of a line
• & the text that matched the regexp

• We'll see all of these in examples…

sed: (simple) examples

• eg.txt =
The cops saw the robber with the binoculars
• sed 's/robber/thief/g' eg.txt
• The cops saw the thief with the binoculars
• sed 's/^/She said, "/g' eg.txt
• She said, "The cops saw the robber with the binoculars
• sed 's/^/She said, "/g' eg.txt | sed 's/$/"/g'
• She said, "The cops saw the robber with the binoculars"

3
sed: syntax examples (from NLP)
• eg2.txt =
(TOP (NP (DT The) (NNS cops)) (VP (VBD saw) (NP (DT the)
(NN robber)) (PP (IN with) (NP (DT the) (NNS binoculars)))))
• "remove the syntactic labels"
hint!: all of (and only) the syntactic labels start with '('
• cat eg2.txt | sed 's/([^ ]* //g' | sed 's/)//g'
• The cops saw the robber with the binoculars
• "now add explicit start & stop sentence symbols
(<s> and </s>, respectively)"
• cat eg2.txt | sed 's/([^ ]* //g' | sed 's/)//g' |
• sed 's/^/<s> /g' | sed 's/$/ <\/s>/g'
• <s> The cops saw the robber with the binoculars </s>

sed: (more complicated) example

• eg2.txt =
(TOP (NP (DT The) (NNS cops)) (VP (VBD saw) (NP (DT the)
(NN robber)) (PP (IN with) (NP (DT the) (NNS binoculars)))))
• "show just the POS-and-word pairs: e.g., (POS word)"
• cat eg2.txt | sed 's/([^ ]* [^(]/~&/g' |
• sed 's/[^)~]*~/ /g' |
• sed 's/^ *//g' |
• sed 's/))*/)/g'
• (DT The) (NNS cops) (VBD saw) (DT the) (NN robber) (IN with)
(DT the) (NNS binoculars)

4
awk: overview
• a simple programming language specifically designed
for text processing
– somewhat similar in nature to Python & Tcl
• WHEN
– using simple variables (counters, arrays, etc.)
– treating each word in a line individually
• HOW
– awk 'BEGIN {initializations}
/regexp1/ {actions1}
/regexp2/ {actions2}
END {final actions}' file.txt
(blue text indicates optional components)

awk: special variables

• NF number of fields in a line
• $ the value of a field variable
• $0 the entire line
• NR current count of input lines

• We'll see all of these in examples…

5
awk: useful constructions & examples
• .each word in a line is a 'field'
$1, $2, …, $NF
imagine every line of text as a row in a table; one
word per column. $1 will be the word in the first
column, $2 the next column, and so on up through
$NF (the last word on the line)
• .eg3.txt =
.The cow jumped over the moon
• .awk '{print $2}' eg3.txt
• .cow
• .cat eg3.txt | awk '{$NF="up"; print $0; \
v="hello"; print v;}' –
• .The cow jumped over the up
.hello

awk: useful constructions & examples

• eg3.txt =
The cow jumped over the moon
• if statements
– awk '{if ($1 == "he") { print $0; }}' eg3.txt
– (empty)
– awk '{if ($1 ~ "he") { print $0; } else { … }}' eg3.txt
– The cow jumped over the moon
The
• for loops cow
– awk '{for (j=1; j <= NF; j++) { print $j }}' eg3.txt jumped
over
– what if I only wanted to print every other word
the
(each on a new line), in reverse order?
moon
– awk '{for (j=NF; j > 0; j-=2) { print $j }}' eg3.txt

6
awk: useful constructions & examples
• eg4.txt =
The cow jumped over the moon
And the dish ran away with the spoon 1 The
• printf statements 2 cow
– awk '{for (j=1; j <= NF; j++) { \ 3 jumped
printf("%d\t%s\n",j,$j);}}' eg4.txt 4 over
– what if I want continuous numbering? 5 the
– awk 'BEGIN {idx=0;} {for (j=1; j <= NF; j++) { \ 6 moon
printf("%d\t%s\n",idx,$j); idx++;}}' eg4.txt 1 And
• substrings 2 the
– substr(<string>, <start>, <end>) …
– awk '{for (j=1; j <= NF; j+=2) { \
printf("%s ",substr($j,1,3))}; print "";}' eg4.txt
– The jum the
And dis awa the

awk: from the homework

0 1 a
1 1 b
1 2 c
2 3 d
3 3 d
3 4 e
4 14

7
awk: from the homework

• Let’s try it!!

Python: overview
• a simple scripting language
– somewhat similar in nature to awk & Tcl
• WHEN
– more than simple reg expressions
– more than one-liners
• HOW
– not discussed here…
– …but very easy language to play with

8
bash: overview
• shell script
• WHEN
– repetitively applying the same commands to many
different files
– automate common tasks
• HOW
– on the command line
– in a file (type `which bash' to find your location):
#!/usr/bin/bash
<commands…>

bash: examples
• for f in *.txt; do
echo $f;
tail –1 $f >> txt.tails;
done
• for (( j=0; j < 4; j++ )); do
cat part$j.txt >> parts0-3.txt;
done
• for f in hw1.*; do
mv $f ${f//hw1/hw2};
done

9
miscellaneous
• sort
– sort -u file.txt
for a uniquely-sorted list of each line in the file
• split
– cat file.txt | split –l 20 –d fold
divide file.txt into files of 20 lines apiece, using “fold” as the
prefix and with numeric suffixes
• wc
– a counting utility
– wc –[l|c|w] file.txt
counts number of lines, characters, or words in a file

Putting it all together!

• .Let's say I'd like to see a numbered list of all the capitalized
words that occurred in a file… but I want the words all in
lowercase.
• for f in part*;
do echo $f;
cat $f | awk 'BEGIN {idx=0} {
for (j=1; j <= NF; j++)
if (substr($j,1,1) ~ "[A-Z]") {
printf("%d\t%s\n", idx, $j);
idx++;
}
}' - | tr [A-Z] [a-z] >
${f//part/out};
echo ${f//part/out};
done

10
Putting it all together!
• Now I'd like to see that same list, but only see each word once
(unique).
• hint: you can tell 'sort' which fields to sort on
• e.g., sort +3 –4 will skip the first 3 fields and stop the sort
at the end of field 4; this will then sort on the 4th field.
sort –k 4,4 will do the same thing
• for f in out*; do
cat $f | sort +1 –2 –u > ${f//out/unique};
done
• and if I wanted to re-number the unique lists?
• for f in out*; do
cat $f | sort –k 2,2 –u | awk 'BEGIN {idx=0}
{$1=idx; print $0; idx++}' > ${f//out/unique};
done

Resources
• You can always look at the man page for help
on any of these tools!
– i.e.: `man sed', or `man tail'
• My favorite online resources:
– sed: www.grymoire.com/Unix/Sed.html
– awk: www.vectorsite.net/tsawk.html
– bash: www.tldp.org/LDP/abs/html/
(particularly section 9.2 on string manipulation)
• Google it. ☺
• OpenFST tutorial
– www.cslu.ogi.edu/~hollingk/JHU_tutorial.html

11
Warning!
• These tools are meant for very simple text-
processing applications!
– Python is the exception…
• Don't abuse them by trying to implement
computationally-intensive programs with them
– like Viterbi search and chart parsing
• Use a more suitable language like
C, C++, (Python), or Java

Vda 6.3 Manual
No ratings yet
Vda 6.3 Manual
22 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Install Guide
67% (3)
Install Guide
64 pages
TUTORIAL How To Start Animating
No ratings yet
TUTORIAL How To Start Animating
23 pages
Lec 05
No ratings yet
Lec 05
39 pages
UNIX II:grep, Awk, Sed: October 30, 2017
No ratings yet
UNIX II:grep, Awk, Sed: October 30, 2017
26 pages
Awk
100% (1)
Awk
9 pages
Blute
No ratings yet
Blute
6 pages
Unit - IV
No ratings yet
Unit - IV
30 pages
AWK One Liners
No ratings yet
AWK One Liners
5 pages
Bash Ch01
No ratings yet
Bash Ch01
14 pages
Introduction To Unix and Linux File Editors
No ratings yet
Introduction To Unix and Linux File Editors
14 pages
Unix Head and Tail Commands
No ratings yet
Unix Head and Tail Commands
8 pages
Lab 1 - Text Processing in Linux N-Grams
No ratings yet
Lab 1 - Text Processing in Linux N-Grams
8 pages
Learning Awk and Sed
No ratings yet
Learning Awk and Sed
14 pages
L12 Basic Linux Extras Sed and Awk Intros Slides
No ratings yet
L12 Basic Linux Extras Sed and Awk Intros Slides
43 pages
Sedawknew
No ratings yet
Sedawknew
18 pages
Assignment (2) Linux
No ratings yet
Assignment (2) Linux
6 pages
Advanced Scripting in Unix: SED, AWK, Makefile & GDB
No ratings yet
Advanced Scripting in Unix: SED, AWK, Makefile & GDB
35 pages
02 - UNIX - Reading: 1 UNIX Commands For Data Scientists
No ratings yet
02 - UNIX - Reading: 1 UNIX Commands For Data Scientists
8 pages
20250424141242903_Sed_Awk
No ratings yet
20250424141242903_Sed_Awk
16 pages
DAC - COS - Last Day Slides
No ratings yet
DAC - COS - Last Day Slides
73 pages
Unix Utilities: Grep, Sed, and Awk
100% (1)
Unix Utilities: Grep, Sed, and Awk
81 pages
Unix SG 3
No ratings yet
Unix SG 3
27 pages
Sed - Awk
No ratings yet
Sed - Awk
7 pages
2023 Unix Lecture 3 Sed and AWK (in some years)
No ratings yet
2023 Unix Lecture 3 Sed and AWK (in some years)
24 pages
Command Line Text Processing
No ratings yet
Command Line Text Processing
363 pages
Awk Introduction
100% (7)
Awk Introduction
7 pages
AWK Hartigan
No ratings yet
AWK Hartigan
4 pages
Awk One-Liners Explained (Preview Copy)
No ratings yet
Awk One-Liners Explained (Preview Copy)
12 pages
Awk ' (Print "Hit The",$1,"with Your",$2) ' Words - Data
No ratings yet
Awk ' (Print "Hit The",$1,"with Your",$2) ' Words - Data
1 page
Awk ' (Print "Hit The",$1,"with Your",$2) ' Words - Data
No ratings yet
Awk ' (Print "Hit The",$1,"with Your",$2) ' Words - Data
1 page
Sed Awk Grep Bash
No ratings yet
Sed Awk Grep Bash
13 pages
Unix Talk #2: AWK Overview Patterns and Actions Records and Fields Print vs. Printf
No ratings yet
Unix Talk #2: AWK Overview Patterns and Actions Records and Fields Print vs. Printf
31 pages
Unix Text Processing
No ratings yet
Unix Text Processing
11 pages
A. Log Into The System: CSE ([email protected] II-Sem) EXP-3
No ratings yet
A. Log Into The System: CSE ([email protected] II-Sem) EXP-3
10 pages
Text Streams and Filters
No ratings yet
Text Streams and Filters
7 pages
Unix Shell Scripting Chapter - 1: List Files That Begin With A Lowercase Letter and Don't End With A Digit
No ratings yet
Unix Shell Scripting Chapter - 1: List Files That Begin With A Lowercase Letter and Don't End With A Digit
10 pages
Unix ETL Interview Questions
No ratings yet
Unix ETL Interview Questions
5 pages
Lecture14 Unix Advanced Commands
No ratings yet
Lecture14 Unix Advanced Commands
13 pages
AwkUsageIn Bash Scripting
No ratings yet
AwkUsageIn Bash Scripting
67 pages
L5 - Reg Exp
No ratings yet
L5 - Reg Exp
38 pages
Pipingfile
No ratings yet
Pipingfile
11 pages
20.10 Filters-Text Processing Commands
No ratings yet
20.10 Filters-Text Processing Commands
14 pages
hostname and host configuration in linux
No ratings yet
hostname and host configuration in linux
17 pages
Awk_one-liners
No ratings yet
Awk_one-liners
58 pages
Sodapdf Converted
No ratings yet
Sodapdf Converted
13 pages
Linux CMD AWK
No ratings yet
Linux CMD AWK
32 pages
A Tutorial: We Deliver Global Engineering Solutions. June 5, 2012
No ratings yet
A Tutorial: We Deliver Global Engineering Solutions. June 5, 2012
10 pages
Perfected_Unix_and_AWK_Guide
No ratings yet
Perfected_Unix_and_AWK_Guide
21 pages
Adv Unix Scripting
100% (2)
Adv Unix Scripting
139 pages
Awk, Sed
No ratings yet
Awk, Sed
15 pages
Unix Important Command
No ratings yet
Unix Important Command
3 pages
UNIX Shells by Example (PDFDrive)
No ratings yet
UNIX Shells by Example (PDFDrive)
1,194 pages
3 CPS393 PipesFilteringScripts
No ratings yet
3 CPS393 PipesFilteringScripts
75 pages
Shell Script Lec 1
No ratings yet
Shell Script Lec 1
33 pages
Lesson 04 Text Files
No ratings yet
Lesson 04 Text Files
6 pages
Lecture 8 - Text Processing
No ratings yet
Lecture 8 - Text Processing
10 pages
MY UNIX7
No ratings yet
MY UNIX7
4 pages
Perl One-Liners: 130 Programs That Get Things Done
From Everand
Perl One-Liners: 130 Programs That Get Things Done
Peteris Krumins
4/5 (3)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Mastering Shell Commands On Linux
From Everand
Mastering Shell Commands On Linux
Urko Galen
No ratings yet
How To Upload A Panelview Plus App
No ratings yet
How To Upload A Panelview Plus App
1 page
Prospects Fujitsu
No ratings yet
Prospects Fujitsu
5 pages
Installing The Erlang Library For Excel
No ratings yet
Installing The Erlang Library For Excel
3 pages
Introduction To Malbolge (Programming in Malbolge)
No ratings yet
Introduction To Malbolge (Programming in Malbolge)
6 pages
How To Create and Use Static Library in MCUXpresso IDE - Ver3
No ratings yet
How To Create and Use Static Library in MCUXpresso IDE - Ver3
7 pages
Dipankar Adhikary
No ratings yet
Dipankar Adhikary
3 pages
DFT, Scan and ATPG - VLSI Tutorials - 1
No ratings yet
DFT, Scan and ATPG - VLSI Tutorials - 1
6 pages
Be Careful Not To Add in The Student Number.
No ratings yet
Be Careful Not To Add in The Student Number.
2 pages
MPPL
No ratings yet
MPPL
70 pages
SetACL Release Notes
No ratings yet
SetACL Release Notes
3 pages
Practical 1 Introduction To ArcMap
No ratings yet
Practical 1 Introduction To ArcMap
13 pages
Api 552 RP
No ratings yet
Api 552 RP
5 pages
Performance Testing With JMeter Second Edition - Sample Chapter
0% (1)
Performance Testing With JMeter Second Edition - Sample Chapter
20 pages
ECS Syllabus Comparision
No ratings yet
ECS Syllabus Comparision
3 pages
Memory2 PDF
No ratings yet
Memory2 PDF
93 pages
Entitlement PDF
No ratings yet
Entitlement PDF
3 pages
Legrand Floor Boxes 18
No ratings yet
Legrand Floor Boxes 18
2 pages
Automation Library: Basics
No ratings yet
Automation Library: Basics
15 pages
VHDL Code For ALU
No ratings yet
VHDL Code For ALU
26 pages
Medium Booklet Style
No ratings yet
Medium Booklet Style
39 pages
Sitt Op.32 Viola
100% (3)
Sitt Op.32 Viola
25 pages
Unit 2 Complete
No ratings yet
Unit 2 Complete
40 pages
8946-MP771 RS232 Code
No ratings yet
8946-MP771 RS232 Code
5 pages
Assembly Programming Language (CS318) : Lecture 01: Introduction
No ratings yet
Assembly Programming Language (CS318) : Lecture 01: Introduction
27 pages
Donner Company Caseanalysis
No ratings yet
Donner Company Caseanalysis
6 pages
CRM Lesson Plan - TPS 18th Batch Term V
No ratings yet
CRM Lesson Plan - TPS 18th Batch Term V
4 pages
E - Neplan SmartGrid v2 1
No ratings yet
E - Neplan SmartGrid v2 1
1 page