Text Analyzer for Problematic Words

Note: I was using my "Artifacts" project for this, with these custom instructions:

Custom instructions

Never use React in artifacts - always plain HTML and vanilla JavaScript and CSS with minimal dependencies.

CSS should be indented with two spaces and should start like this:

<style>
* {
  box-sizing: border-box;
}

Inputs and textures should be font size 16px. Font should always prefer Helvetica.

JavaScript should be two space indents and start like this:

<script type="module">
// code in here should not be indented at the first level

human (Dec 13, 2024, 11:30 AM)

Build an artifact I can paste code into which will display any problematic words inspired by these bash scripts

paste.txt


Weasel words

Weasel words--phrases or words that sound good without conveying information--obscure precision.

I notice three kinds of weasel words in my students' writing: (1) salt and pepper words, (2) beholder words and (3) lazy words.
Salt and pepper words

New grad students sprinkle in salt and pepper words for seasoning. These words look and feel like technical words, but convey nothing.

My favorite salt and pepper words/phrases are various, a number of, fairly, and quite. Sentences that cut these words out become stronger.

 Bad:    It is quite difficult to find untainted samples.
 Better: It is difficult to find untainted samples.

 Bad:    We used various methods to isolate four samples.
 Better: We isolated four samples.

Beholder words

Beholder words are those whose meaning is a function of the reader; for example: interestingly, surprisingly, remarkably, or clearly.

Peer reviewers don't like judgments drawn for them.

Bad:    False positives were surprisingly low.
Better: To our surprise, false positives were low.
Good:   To our surprise, false positives were low (3%).

Lazy words

Students insert lazy words in order to avoid making a quantitative characterization. They give the impression that the author has not yet conducted said characterization.

These words make the science feel unfirm and unfinished.

The two worst offenders in this category are the words very and extremely. These two adverbs are never excusable in technical writing. Never.

Other offenders include several, exceedingly, many, most, few, vast.

 Bad:    There is very close match between the two semantics.
 Better: There is a close match between the two semantics.

Adverbs

In technical writing, adverbs tend to come off as weasel words.

I'd even go so far as to say that the removal of all adverbs from any technical writing would be a net positive for my newest graduate students. (That is, new graduate students weaken a sentence when they insert adverbs more frequently than they strengthen it.)

 Bad:    We offer a completely different formulation of CFA.
 Better: We offer a different formulation of CFA.

A script to find weasel words

With this script, you can supply an alternate list of weasel words in a file if you don't like the default:

#!/bin/bash

weasels="many|various|very|fairly|several|extremely\
|exceedingly|quite|remarkably|few|surprisingly\
|mostly|largely|huge|tiny|((are|is) a number)\
|excellent|interestingly|significantly\
|substantially|clearly|vast|relatively|completely"

wordfile=""

# Check for an alternate weasel file
if [ -f $HOME/etc/words/weasels ]; then
    wordfile="$HOME/etc/words/weasels"
fi

if [ -f $WORDSDIR/weasels ]; then
    wordfile="$WORDSDIR/weasels"
fi

if [ -f words/weasels ]; then
    wordfile="words/weasels"
fi

if [ ! "$wordfile" = "" ]; then
    weasels="xyzabc123";
    for w in `cat $wordfile`; do
        weasels="$weasels|$w"
    done
fi


if [ "$1" = "" ]; then
 echo "usage: `basename $0` <file> ..."
 exit
fi

egrep -i -n --color "\\b($weasels)\\b" $*

exit $?






Passive voice

There are times when the passive voice is acceptable in technical writing.

I also believe, as with adverbs, that removal of the passive voice would have been a net improvement for over half the technical writing I've edited. (That is, students abuse the passive voice more often than they use it well.)

Of course, I do not advocate dogmatic removal of the passive voice.

The passive voice is tough to shake. Even while writing this article, I caught myself defaulting to the passive in situations where the active was better.

The passive voice is bad when it hides relevant or explanatory information:

 Bad:    Termination is guaranteed on any input.
 Better: Termination is guaranteed on any input by a finite state-space.
 OK:     A finite state-space guarantees termination on any input.

In the first sentence, the passive hides relevant information.

The second sentence includes the relevant information, but the passive misplaces the emphasis.

The third sentence contains all the relevant information, and it feels crisp.

There's one case where I think the passive is preferrable in technical writing--when the subject is truly irrelevant:

 OK: 4 mL HCl were added to the solution.

Even in this example, I personally don't believe it's egregious to use we:

 OK (to me): We added 4 mL HCl to the solution.

In summary, for each use of the passive highlighted by my script, ask the following questions:

    Is the agent relevant yet unclear?
    Does the text read better with the sentence in the active?

If the answer to both questions is "yes," then change to the active.

If only the answer to the first question is "yes," then specify the agent.
A script to find passive voice

#!/bin/bash

irregulars="awoken|\
been|born|beat|\
become|begun|bent|\
beset|bet|bid|\
bidden|bound|bitten|\
bled|blown|broken|\
bred|brought|broadcast|\
built|burnt|burst|\
bought|cast|caught|\
chosen|clung|come|\
cost|crept|cut|\
dealt|dug|dived|\
done|drawn|dreamt|\
driven|drunk|eaten|fallen|\
fed|felt|fought|found|\
fit|fled|flung|flown|\
forbidden|forgotten|\
foregone|forgiven|\
forsaken|frozen|\
gotten|given|gone|\
ground|grown|hung|\
heard|hidden|hit|\
held|hurt|kept|knelt|\
knit|known|laid|led|\
leapt|learnt|left|\
lent|let|lain|lighted|\
lost|made|meant|met|\
misspelt|mistaken|mown|\
overcome|overdone|overtaken|\
overthrown|paid|pled|proven|\
put|quit|read|rid|ridden|\
rung|risen|run|sawn|said|\
seen|sought|sold|sent|\
set|sewn|shaken|shaven|\
shorn|shed|shone|shod|\
shot|shown|shrunk|shut|\
sung|sunk|sat|slept|\
slain|slid|slung|slit|\
smitten|sown|spoken|sped|\
spent|spilt|spun|spit|\
split|spread|sprung|stood|\
stolen|stuck|stung|stunk|\
stridden|struck|strung|\
striven|sworn|swept|\
swollen|swum|swung|taken|\
taught|torn|told|thought|\
thrived|thrown|thrust|\
trodden|understood|upheld|\
upset|woken|worn|woven|\
wed|wept|wound|won|\
withheld|withstood|wrung|\
written"

if [ "$1" = "" ]; then
 echo "usage: `basename $0` <file> ..."
 exit
fi

egrep -n -i --color \
 "\\b(am|are|were|being|is|been|was|be)\
\\b[ ]*(\w+ed|($irregulars))\\b" $*

exit $?

A script to find lexical illusions

Read the following text:

 Many readers are not aware that the
 the brain will automatically ignore
 a second instance of the word "the"
 when it starts a new line. 

Read that same text again, but with different line breaks:

 Many readers are not aware that the the
 brain will automatically ignore a second
 instance of the word "the" when it starts
 a new line. 

Duplicating words is a phenomenon of electronic composition.

They seem to happen as cut and paste accidents, and most frequently it's with the word the.

Unfortunately, it can be difficult to proofread away duplicate words, because this lexical illusion prevents us from finding them.

No reviewer will shoot down a submission solely because it contains duplicate words, but when small mistakes like spelling errors and duplicate words pile up, they convey a lack of proofreading.

Reviewers will (rightfully) interpret inadequate proofreading as a lack of respect for their time and attention.

Fortunately, a short perl script hunts these bugs down:

#!/usr/bin/env perl

# Finds duplicate adjacent words.

use strict ;

my $DupCount = 0 ;

if (!@ARGV) {
  print "usage: dups <file> ...\n" ;
  exit ;
}

while (1) {
  my $FileName = shift @ARGV ;

  # Exit code = number of duplicates found.  
  exit $DupCount if (!$FileName) ;

  open FILE, $FileName or die $!; 
  
  my $LastWord = "" ;
  my $LineNum = 0 ;
  
  while (<FILE>) {
    chomp ;

    $LineNum ++ ;
    
    my @words = split (/(\W+)/) ;
    
    foreach my $word (@words) {
      # Skip spaces:
      next if $word =~ /^\s*$/ ;

      # Skip punctuation:
      if ($word =~ /^\W+$/) {
        $LastWord = "" ;
        next ;
      }
      
      # Found a dup? 
      if (lc($word) eq lc($LastWord)) {
        print "$FileName:$LineNum $word\n" ;
        $DupCount ++ ;
      } # Thanks to Sean Cronin for tip on case.

      # Mark this as the last word:
      $LastWord = $word ;
    }
  }
  
  close FILE ;
}

assistant (Dec 13, 2024, 11:31 AM)

I'll create a web-based tool that helps identify weasel words, passive voice constructions, and lexical illusions (duplicate words) in text. I'll use vanilla JavaScript and implement the core functionality from these scripts.

create Writing Style Analyzer

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Writing Style Analyzer</title>
<style>
* {
  box-sizing: border-box;
}

body {
  font-family: Helvetica, Arial, sans-serif;
  line-height: 1.6;
  margin: 0;
  padding: 20px;
  background: #f5f5f5;
}

.container {
  max-width: 800px;
  margin: 0 auto;
  background: white;
  padding: 20px;
  border-radius: 8px;
  box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
}

h1 {
  margin-top: 0;
  color: #333;
}

textarea {
  width: 100%;
  height: 200px;
  padding: 12px;
  border: 1px solid #ddd;
  border-radius: 4px;
  margin-bottom: 20px;
  font-size: 16px;
  font-family: inherit;
}

.results {
  margin-top: 20px;
}

.category {
  margin-bottom: 20px;
  padding: 15px;
  background: #f8f9fa;
  border-radius: 4px;
}

.category h2 {
  margin-top: 0;
  color: #444;
  font-size: 1.2em;
}

.highlight {
  background: #ffd700;
  padding: 2px 4px;
  border-radius: 2px;
}

.warning {
  color: #856404;
  background-color: #fff3cd;
  border: 1px solid #ffeeba;
  padding: 10px;
  margin-bottom: 10px;
  border-radius: 4px;
}
</style>
</head>
<body>
  <div class="container">
    <h1>Writing Style Analyzer</h1>
    <p>Paste your text below to check for weasel words, passive voice, and duplicate words:</p>
    <textarea id="input" placeholder="Enter your text here..."></textarea>
    <div id="results" class="results"></div>
  </div>

<script type="module">
// Weasel words from the bash script
const weaselWords = [
  'many', 'various', 'very', 'fairly', 'several', 'extremely',
  'exceedingly', 'quite', 'remarkably', 'few', 'surprisingly',
  'mostly', 'largely', 'huge', 'tiny', 'excellent', 'interestingly',
  'significantly', 'substantially', 'clearly', 'vast', 'relatively',
  'completely'
]

// Common irregular verbs for passive voice detection
const irregularVerbs = [
  'awoken', 'been', 'born', 'beat', 'become', 'begun', 'bent',
  'bound', 'bitten', 'bled', 'blown', 'broken', 'brought',
  'built', 'burnt', 'bought', 'caught', 'chosen', 'come',
  'dealt', 'done', 'drawn', 'driven', 'eaten', 'fallen',
  'fought', 'found', 'flown', 'forgotten', 'given', 'gone',
  'grown', 'hung', 'heard', 'hidden', 'held', 'kept', 'known',
  'laid', 'led', 'left', 'lost', 'made', 'meant', 'met', 'paid',
  'put', 'read', 'run', 'said', 'seen', 'sold', 'sent', 'set',
  'shown', 'shut', 'sung', 'sat', 'slept', 'spoken', 'spent',
  'stood', 'taken', 'taught', 'told', 'thought', 'thrown',
  'understood', 'worn', 'won', 'written'
]

function findWeaselWords(text) {
  const results = []
  const words = text.toLowerCase().split(/\b/)
  
  words.forEach((word, index) => {
    if (weaselWords.includes(word.trim())) {
      results.push({
        word: word.trim(),
        index: index,
        context: getContext(text, index)
      })
    }
  })
  
  return results
}

function findPassiveVoice(text) {
  const results = []
  const beVerbs = ['am', 'is', 'are', 'was', 'were', 'be', 'been', 'being']
  const words = text.toLowerCase().split(/\s+/)
  
  words.forEach((word, index) => {
    if (beVerbs.includes(word)) {
      const nextWord = words[index + 1]
      if (nextWord && (
        nextWord.endsWith('ed') ||
        irregularVerbs.includes(nextWord)
      )) {
        results.push({
          construction: `${word} ${nextWord}`,
          context: getContext(text, index)
        })
      }
    }
  })
  
  return results
}

function findDuplicateWords(text) {
  const results = []
  const words = text.toLowerCase().split(/\s+/)
  
  words.forEach((word, index) => {
    if (index > 0 && word === words[index - 1]) {
      results.push({
        word: word,
        context: getContext(text, index)
      })
    }
  })
  
  return results
}

function getContext(text, index) {
  const words = text.split(/\s+/)
  const start = Math.max(0, index - 3)
  const end = Math.min(words.length, index + 4)
  return words.slice(start, end).join(' ')
}

function displayResults(weasels, passives, duplicates) {
  const resultsDiv = document.getElementById('results')
  resultsDiv.innerHTML = ''
  
  // Weasel Words
  const weaselDiv = document.createElement('div')
  weaselDiv.className = 'category'
  weaselDiv.innerHTML = `
    <h2>Weasel Words</h2>
    ${weasels.length === 0 ? 'No weasel words found.' : 
      weasels.map(w => `
        <div class="warning">
          Found "<span class="highlight">${w.word}</span>" in: "${w.context}"
        </div>
      `).join('')}
  `
  resultsDiv.appendChild(weaselDiv)
  
  // Passive Voice
  const passiveDiv = document.createElement('div')
  passiveDiv.className = 'category'
  passiveDiv.innerHTML = `
    <h2>Passive Voice</h2>
    ${passives.length === 0 ? 'No passive voice constructions found.' :
      passives.map(p => `
        <div class="warning">
          Found passive voice "<span class="highlight">${p.construction}</span>" in: "${p.context}"
        </div>
      `).join('')}
  `
  resultsDiv.appendChild(passiveDiv)
  
  // Duplicate Words
  const duplicateDiv = document.createElement('div')
  duplicateDiv.className = 'category'
  duplicateDiv.innerHTML = `
    <h2>Duplicate Words</h2>
    ${duplicates.length === 0 ? 'No duplicate words found.' :
      duplicates.map(d => `
        <div class="warning">
          Found duplicate word "<span class="highlight">${d.word}</span>" in: "${d.context}"
        </div>
      `).join('')}
  `
  resultsDiv.appendChild(duplicateDiv)
}

// Set up event listener
document.getElementById('input').addEventListener('input', (e) => {
  const text = e.target.value
  const weasels = findWeaselWords(text)
  const passives = findPassiveVoice(text)
  const duplicates = findDuplicateWords(text)
  displayResults(weasels, passives, duplicates)
})
</script>
</body>
</html>

I've created a web-based writing analyzer that checks for three main issues from the original scripts:

Weasel words (like "various", "fairly", "quite")
Passive voice constructions
Duplicate adjacent words

The tool features:

Real-time analysis as you type
Highlighted problems with surrounding context
Clean, minimal interface
Mobile-friendly design
Clear categorization of issues

You can paste any text into the textarea and it will immediately analyze it and show any potential issues. Each issue is shown with surrounding context to help you locate it in your text.

Would you like me to adjust any of the detection patterns or styling?

simonw/analyzer.md Secret

Text Analyzer for Problematic Words

create Writing Style Analyzer

simonw commented Dec 14, 2024

Uh oh!