How We Built CodeHawk: AI Code Review with Inline PR Comments

We built CodeHawk because every code review process has the same problem: humans get tired.

You review 5 PRs, you're sharp. By PR 20, you're skimming. By PR 40, you're approving things you'd normally flag. The mechanical stuff — null checks, error handling, SQL injection risks — slips through.

We wanted to catch that mechanical layer automatically, so humans could focus on what they're actually good at: architecture, design, intent.

This is how we built it.

The Core Idea

CodeHawk is a GitHub App that reviews pull requests using Claude. The flow is simple:

Developer opens a PR
GitHub webhook triggers our app
We fetch the diff, send it to Claude
Claude analyzes the code and returns issues
We post inline comments on the PR
Developer sees specific, actionable feedback immediately

The key constraint: it has to be fast. A review that takes 30 seconds is useful. One that takes 5 minutes defeats the purpose.

Architecture

CodeHawk runs on Node.js with TypeScript and the Probot framework — GitHub's official library for building GitHub Apps.

Probot handles all the GitHub API ceremony: authenticating with private keys, validating webhooks, managing rate limits. Without it, we'd spend half our time wrestling with GitHub's auth system.

// Simplified: the core webhook handler
app.on('pull_request.opened', async (context) => {
  const diff = await context.octokit.pulls.get({
    owner: context.payload.repository.owner.login,
    repo: context.payload.repository.name,
    pull_number: context.payload.pull_request.number,
    mediaType: { format: 'diff' }
  });

  const review = await claude.analyzeCode(diff.data);
  
  for (const issue of review.issues) {
    await context.octokit.pulls.createReview({
      owner: context.payload.repository.owner.login,
      repo: context.payload.repository.name,
      pull_number: context.payload.pull_request.number,
      body: issue.comment,
      event: 'COMMENT'
    });
  }
});

Real code is more complex — error handling, rate limiting, config file parsing — but the idea is that simple.

Claude: The Reviewer

The hard part isn't the GitHub integration. It's making the AI actually useful.

We use Claude Sonnet 4.6 because it's:

Fast (under 5 seconds for typical diffs)
Accurate (catches real bugs, not just style nits)
Reasonably priced (scales with review volume)

The prompt we send to Claude is surprisingly detailed:

You are a code reviewer. Analyze this diff and identify:

1. Security issues (injection, auth gaps, exposed secrets)
2. Bugs (null dereferences, logic errors, off-by-one)
3. Missing error handling (unhandled promises, uncaught exceptions)
4. Performance problems (inefficient loops, N+1 queries)
5. Style/consistency issues (per this repo's conventions)

For each issue, respond with:
- Line number
- Severity (error, warning, info)
- Clear, specific comment (don't be vague)
- Suggested fix (if applicable)

Ignore: auto-generated code, lock files, migrations, vendored code

The prompt is tuned for the kinds of bugs we want to catch. We don't ask Claude to review architecture (it can't), or to know your specific business logic (it shouldn't). We ask it to find mechanical issues that linters miss.

What Works

Inline comments. GitHub's review API lets us post comments directly on the relevant lines. Not a generic summary — actual, actionable feedback. This is critical. A review that says "there's an issue somewhere" is useless. A review that says "line 42, null dereference" is gold.

Fast feedback. Most reviews post within 5 seconds. Developers see the feedback before they even leave the PR page. That immediate loop tightens the development cycle.

Configurable severity. Some teams want CodeHawk to catch everything. Others only care about errors. The .codehawk.yml config file lets teams tune behavior:

severity_threshold: error        # Skip warnings and info
ignore_paths:
  - "*.generated.ts"
  - "dist/**"
focus_areas:
  - security                      # Priority: security issues
  - error-handling

Real-world example.

Last week, CodeHawk caught this in a production pull request:

// The PR introduced this function
async function getUser(id) {
  const user = await db.query('SELECT * FROM users WHERE id = ?', [id]);
  return user.email.toLowerCase();  // ← Bug: user could be null
}

// CodeHawk flagged:
// Line 3: Potential null dereference. 'user' may be undefined.
// Use optional chaining or check before dereferencing.

The developer fixed it:

async function getUser(id) {
  const user = await db.query('SELECT * FROM users WHERE id = ?', [id]);
  if (!user) throw new Error(`User ${id} not found`);
  return user.email.toLowerCase();
}

This bug would have made it to production without CodeHawk. It's the kind of thing that passes code review because reviewers skip over it, doesn't show up in tests because the edge case isn't covered, and then crashes at 2am on a Saturday.

What Doesn't Work

Architecture review. CodeHawk doesn't know if this feature is over-engineered. It won't catch "you shouldn't use a microservice for this." Those decisions need humans who know the system.

Business logic. The function might be syntactically perfect and still implement the wrong algorithm. CodeHawk can't read the spec.

Context. CodeHawk doesn't know the codebase's conventions, history, or constraints. It'll flag things that are intentional in your context.

Real false positives. We've tuned the prompt to minimize false alarms, but they happen. That's why developers can dismiss comments.

Honest Tradeoffs

We're not claiming CodeHawk replaces human reviewers. It augments them.

Human reviewers are good at: architecture, design, intent, cross-cutting concerns, knowledge of the codebase.

CodeHawk is good at: mechanical bugs, security patterns, missing error handling, style consistency.

The best process uses both.

Availability

CodeHawk is live and free during beta: Install on GitHub. Unlimited reviews, no credit card required.

When we exit beta, pricing will be: Free tier (3 reviews/month) and Pro ($79/month per org, unlimited reviews).

We built it because we got tired of missing bugs during code review. We hope it saves you from doing the same.

Try it on your next PR. Let us know what it catches.