doc/faq.md - zoekt - Git at Google

 # Frequently asked questions

 ## Why codesearch?

 Software engineering is more about reading than writing code, and part
 of this process is finding the code that you should read. If you are
 working on a large project, then finding source code through
 navigation quickly becomes inefficient.

 Search engines let you find interesting code much faster than browsing
 code, in much the same way that search engines speed up finding things
 on the internet.

 ## Can you give an example?

 I had to implement SSH hashed hostkey checking on a whim recently, and
 here is how I quickly zoomed into the relevant code using
 [our public zoekt instance](http://cs.bazel.build):

 * [hash host ssh](http://cs.bazel.build/search?q=hash+host+ssh&num=50): more than 20k results in 750 files, in 3 seconds

 * [hash host r:openssh](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh&num=50): 6k results in 114 files, in 20ms

 * [hash host r:openssh known_host](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh+known_host&num=50): 4k result in 42 files, in 13ms

 the last query still yielded a substantial number of results, but the
 function `hash_host` that I was looking for was the 3rd result from
 the first file.

 ## What features make a code search engine great?

 Often, you don't know exactly what you are looking for, until you
 found it. Code search is effective because you can formulate an
 approximate query, and then refine it based on results you got. For
 this to work, you need the following features:

 * Coverage: the code that interests you should be available for searching

 * Speed: search should return useful results quickly (sub-second), so
   you can iterate on queries

 * Approximate queries: matching should be done case insensitively, on
   arbitrary substrings, so we don't have to know what we are looking
   for in advance.

 * Filtering: we can winnow down results by composing more specific queries

 * Ranking: interesting results (eg. function definitions, whole word
   matches) should be at the top.

 ## How does `zoekt` provide for these?

 * Coverage: `zoekt` comes with tools to mirror parts of common Git
   hosting sites. `cs.bazel.build` uses this to index most of the
   Google authored open source software on github.com and
   googlesource.com.

 * Speed: `zoekt` uses an index based on positional trigrams. For rare
   strings, eg. `nienhuys`, this typically yields results in ~10ms if
   the operating system caches are warm.

 * Approximate queries: `zoekt` supports substring patterns and regular
   expressions, and can do case-insensitive matching on UTF-8 text.

 * Filtering: you can filter query by adding extra atoms (eg. `f:\.go$`
   limits to Go source code), and filter out terms with `-`, so
   `\blinus\b -torvalds` finds the Linuses other than Linus Torvalds.

 * Ranking: zoekt uses
   [ctags](https://github.com/universal-ctags/ctags) to find
   declarations, and these are boosted in the search ranking.


 ## How does this compare to `grep -r`?

 Grep lets you find arbitrary substrings, but it doesn't scale to large
 corpuses, and lacks filtering and ranking.

 ## What about my IDE?

 If your project fits into your IDE, than that is great.
 Unfortunately, loading projects into IDEs is slow, cumbersome, and not
 supported by all projects.

 ## What about the search on `github.com`?

 Github's search has great coverage, but unfortunately, its search
 functionality doesn't support arbitrary substrings. For example, a
 query [for part of my
 surname](https://github.com/search?utf8=%E2%9C%93&q=nienhuy&type=Code)
 does not turn up anything (except this document), while
 [my complete
 name](https://github.com/search?utf8=%E2%9C%93&q=nienhuys&type=Code)
 does.

 ## What about Etsy/Hound?

 [Etsy/hound](https://github.com/etsy/hound) is a code search engine
 which supports regular expressions over large corpuses, it is about
 10x slower than zoekt. However, there is only rudimentary support for
 filtering, and there is no symbol ranking.

 ## What about livegrep?

 [livegrep](https://livegrep.com) is a code search engine which
 supports regular expressions over large corpuses. However, due to its
 indexing technique, it requires a lot of RAM and CPU.  There is only
 rudimentary support for filtering, and there is no symbol ranking.

 ## How much resources does `zoekt` require?

 The search server should have local SSD to store the index file (which
 is 3.5x the corpus size), and have at least 20% more RAM than the
 corpus size.

 ## Can I index multiple branches?

 Yes. You can index 64 branches (see also
 https://github.com/google/zoekt/issues/32). Files that are identical
 across branches take up space just once in the index.

 ## How fast is the search?

 Rare strings, are extremely fast to retrieve, for example `r:torvalds
 crazy` (search "crazy" in the linux kernel) typically takes [about
 7-10ms on
 cs.bazel.build](http://cs.bazel.build/search?q=r%3Atorvalds+crazy&num=70).

 The speed for common strings is dominated by how many results you want
 to see. For example [r:torvalds license] can give some results
 quickly, but producing [all 86k
 results](http://cs.bazel.build/search?q=r%3Atorvalds+license&num=50000)
 takes between 100ms and 1 second. Then, streaming the results to your
 browser, and rendering the HTML takes several seconds.

 ## How fast is the indexer?

 The Linux kernel (55K files, 545M data) takes about 160s to index on
 my x250 laptop using a single thread.  The process can be parallelized
 for speedup.

 ## What does [cs.bazel.build](https://cs.bazel.build/) run on?

 Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an
 attached physical SSD.

 ## How does `zoekt` work?

 In short, it splits up the file in trigrams (groups of 3 unicode
 characters), and stores the offset of each occurrence. Substrings are
 found by searching different trigrams from the query at the correct
 distance apart.

 ## I want to know more

 Some further background documentation

  * [Designdoc](design.md) for technical details
  * [Godoc](https://godoc.org/github.com/google/zoekt)
  * Gerrit 2016 user summit: [slides](https://storage.googleapis.com/gerrit-talks/summit/2016/zoekt.pdf)
  * Gerrit 2017 user summit: [transcript](https://gitenterprise.me/2017/11/01/gerrit-user-summit-zoekt-code-search-engine/),  [slides](https://storage.googleapis.com/gerrit-talks/summit/2017/Zoekt%20-%20improved%20codesearch.pdf), [video](https://www.youtube.com/watch?v=_-KTAvgJYdI)
	# Frequently asked questions

	## Why codesearch?

	Software engineering is more about reading than writing code, and part
	of this process is finding the code that you should read. If you are
	working on a large project, then finding source code through
	navigation quickly becomes inefficient.

	Search engines let you find interesting code much faster than browsing
	code, in much the same way that search engines speed up finding things
	on the internet.

	## Can you give an example?

	I had to implement SSH hashed hostkey checking on a whim recently, and
	here is how I quickly zoomed into the relevant code using
	[our public zoekt instance](http://cs.bazel.build):

	* [hash host ssh](http://cs.bazel.build/search?q=hash+host+ssh&num=50): more than 20k results in 750 files, in 3 seconds

	* [hash host r:openssh](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh&num=50): 6k results in 114 files, in 20ms

	* [hash host r:openssh known_host](http://cs.bazel.build/search?q=hash+host+r%3Aopenssh+known_host&num=50): 4k result in 42 files, in 13ms

	the last query still yielded a substantial number of results, but the
	function `hash_host` that I was looking for was the 3rd result from
	the first file.

	## What features make a code search engine great?

	Often, you don't know exactly what you are looking for, until you
	found it. Code search is effective because you can formulate an
	approximate query, and then refine it based on results you got. For
	this to work, you need the following features:

	* Coverage: the code that interests you should be available for searching

	* Speed: search should return useful results quickly (sub-second), so
	you can iterate on queries

	* Approximate queries: matching should be done case insensitively, on
	arbitrary substrings, so we don't have to know what we are looking
	for in advance.

	* Filtering: we can winnow down results by composing more specific queries

	* Ranking: interesting results (eg. function definitions, whole word
	matches) should be at the top.

	## How does `zoekt` provide for these?

	* Coverage: `zoekt` comes with tools to mirror parts of common Git
	hosting sites. `cs.bazel.build` uses this to index most of the
	Google authored open source software on github.com and
	googlesource.com.

	* Speed: `zoekt` uses an index based on positional trigrams. For rare
	strings, eg. `nienhuys`, this typically yields results in ~10ms if
	the operating system caches are warm.

	* Approximate queries: `zoekt` supports substring patterns and regular
	expressions, and can do case-insensitive matching on UTF-8 text.

	* Filtering: you can filter query by adding extra atoms (eg. `f:\.go$`
	limits to Go source code), and filter out terms with `-`, so
	`\blinus\b -torvalds` finds the Linuses other than Linus Torvalds.

	* Ranking: zoekt uses
	[ctags](https://github.com/universal-ctags/ctags) to find
	declarations, and these are boosted in the search ranking.


	## How does this compare to `grep -r`?

	Grep lets you find arbitrary substrings, but it doesn't scale to large
	corpuses, and lacks filtering and ranking.

	## What about my IDE?

	If your project fits into your IDE, than that is great.
	Unfortunately, loading projects into IDEs is slow, cumbersome, and not
	supported by all projects.

	## What about the search on `github.com`?

	Github's search has great coverage, but unfortunately, its search
	functionality doesn't support arbitrary substrings. For example, a
	query [for part of my
	surname](https://github.com/search?utf8=%E2%9C%93&q=nienhuy&type=Code)
	does not turn up anything (except this document), while
	[my complete
	name](https://github.com/search?utf8=%E2%9C%93&q=nienhuys&type=Code)
	does.

	## What about Etsy/Hound?

	[Etsy/hound](https://github.com/etsy/hound) is a code search engine
	which supports regular expressions over large corpuses, it is about
	10x slower than zoekt. However, there is only rudimentary support for
	filtering, and there is no symbol ranking.

	## What about livegrep?

	[livegrep](https://livegrep.com) is a code search engine which
	supports regular expressions over large corpuses. However, due to its
	indexing technique, it requires a lot of RAM and CPU. There is only
	rudimentary support for filtering, and there is no symbol ranking.

	## How much resources does `zoekt` require?

	The search server should have local SSD to store the index file (which
	is 3.5x the corpus size), and have at least 20% more RAM than the
	corpus size.

	## Can I index multiple branches?

	Yes. You can index 64 branches (see also
	https://github.com/google/zoekt/issues/32). Files that are identical
	across branches take up space just once in the index.

	## How fast is the search?

	Rare strings, are extremely fast to retrieve, for example `r:torvalds
	crazy` (search "crazy" in the linux kernel) typically takes [about
	7-10ms on
	cs.bazel.build](http://cs.bazel.build/search?q=r%3Atorvalds+crazy&num=70).

	The speed for common strings is dominated by how many results you want
	to see. For example [r:torvalds license] can give some results
	quickly, but producing [all 86k
	results](http://cs.bazel.build/search?q=r%3Atorvalds+license&num=50000)
	takes between 100ms and 1 second. Then, streaming the results to your
	browser, and rendering the HTML takes several seconds.

	## How fast is the indexer?

	The Linux kernel (55K files, 545M data) takes about 160s to index on
	my x250 laptop using a single thread. The process can be parallelized
	for speedup.

	## What does [cs.bazel.build](https://cs.bazel.build/) run on?

	Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an
	attached physical SSD.

	## How does `zoekt` work?

	In short, it splits up the file in trigrams (groups of 3 unicode
	characters), and stores the offset of each occurrence. Substrings are
	found by searching different trigrams from the query at the correct
	distance apart.

	## I want to know more

	Some further background documentation

	* [Designdoc](design.md) for technical details
	* [Godoc](https://godoc.org/github.com/google/zoekt)
	* Gerrit 2016 user summit: [slides](https://storage.googleapis.com/gerrit-talks/summit/2016/zoekt.pdf)
	* Gerrit 2017 user summit: [transcript](https://gitenterprise.me/2017/11/01/gerrit-user-summit-zoekt-code-search-engine/), [slides](https://storage.googleapis.com/gerrit-talks/summit/2017/Zoekt%20-%20improved%20codesearch.pdf), [video](https://www.youtube.com/watch?v=_-KTAvgJYdI)