Saturday, April 30, 2016

libclang: how to get token semantics

Leave a Comment

libclang defines only 5 types of tokens:

  • CXToken_Punctuation
  • CXToken_Keyword
  • CXToken_Identifier
  • CXToken_Literal
  • CXToken_Comment

Is it possible to get a more detailed information about tokens? For example, for the following source code:

struct Type; void foo(Type param); 

I would expect the output to be like:

  • struct - keyword
  • Type - type name
  • ; - punctuation
  • void - type/keyword
  • foo - function name
  • ( - punctuation
  • Type - type of the function parameter
  • param - function parameter name
  • ) - punctuation
  • ; - punctuation

I also need to map those entities to file locations.

2 Answers

Answers 1

First, you probably need a bit of background on how parsing works. A textbook on compilers would be a useful resource. First, the file is converted into a series of tokens; that gives you identifiers, punctuation, etc. The code that does this is called a lexer. Then, the parser runs; this converts a list of tokens into an AST (structured declarations/expressions/etc.).

clang does keep track of the various parts of declarations and expressions, but not in the way you're describing. For a given function declaration, it keeps track of things like the location of the name of the function and the start of the parameter list, but it keeps those in terms of locations in the file, not tokens.

A CXToken is just a token; there isn't any additional associated semantic information beyond the five types you listed. (You can get the actual text of the token with clang_getTokenSpelling, and the location with clang_getTokenExtent.) clang_annotateTokens gives you CXCursors, which let you examine the relevant declarations.

Note that some details aren't exposed by the libclang API; if you need more detail, you might need to use clang's C++ API instead.

Answers 2

You're looking for the token spelling and location attributes exposed by libclang. In C++ these can be retrieved using the functions clang_getTokenLocation and clang_getTokenSpelling. A minimal use of these functions (using their python equivalents would be:

s = ''' struct Type; void foo(Type param); '''  idx = clang.cindex.Index.create() tu = idx.parse('tmp.cpp', args=['-std=c++11'],  unsaved_files=[('tmp.cpp', s)],  options=0) for t in tu.get_tokens(extent=tu.cursor.extent):     print t.kind, t.spelling, t.location 

Gives:

TokenKind.KEYWORD struct <SourceLocation file 'tmp.cpp', line 2, column 1> TokenKind.IDENTIFIER Type <SourceLocation file 'tmp.cpp', line 2, column 8> TokenKind.PUNCTUATION ; <SourceLocation file 'tmp.cpp', line 2, column 12> TokenKind.KEYWORD void <SourceLocation file 'tmp.cpp', line 3, column 1> TokenKind.IDENTIFIER foo <SourceLocation file 'tmp.cpp', line 3, column 6> TokenKind.PUNCTUATION ( <SourceLocation file 'tmp.cpp', line 3, column 9> TokenKind.IDENTIFIER Type <SourceLocation file 'tmp.cpp', line 3, column 10> TokenKind.IDENTIFIER param <SourceLocation file 'tmp.cpp', line 3, column 15> TokenKind.PUNCTUATION ) <SourceLocation file 'tmp.cpp', line 3, column 20> TokenKind.PUNCTUATION ; <SourceLocation file 'tmp.cpp', line 3, column 21> 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment