Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R: rewrite the parser #2452

Merged
merged 13 commits into from
May 8, 2020
Merged

R: rewrite the parser #2452

merged 13 commits into from
May 8, 2020

Conversation

masatake
Copy link
Member

@masatake masatake commented Mar 1, 2020

Close #2393.

Still considering how to handle 'a$b'.
The original parser may capture 'a$b'. However, capturing just 'b' looks better.
In that case, I would like to resolve a.
In that case, the symbol table feature developed at #2427 is needed.

Class related codes are not implemented yet.

@masatake masatake changed the title R rewrite [WIP] R: rewrite the parser Mar 1, 2020
@suntzuisafterU
Copy link

Hey, I pulled the branch from your fork and was not able to build it for some reason. I can build the ctags master branch, not sure what the difference is.

I am on OSX.

@masatake
Copy link
Member Author

Do you have an error in building ctags or a trouble about git operation?
If what you got is the former one, could you show me the error message?

@suntzuisafterU
Copy link

It's a build issue. I couldn't find the documentation for how to run it so I just ran autoconfig then make. It gave an error about no symbols for x86_64. Sorry I'm afk at the moment.

So I can build the master branch of ctags but can't remember how it was configured.

@masatake
Copy link
Member Author

masatake commented Mar 10, 2020

gcc, gnu make, automake, autoconf, and pkg-config must be installed.

yamato@slave]/tmp% cd /tmp
[yamato@slave]/tmp% git clone https://github.com/universal-ctags/ctags.git
Cloning into 'ctags'...
remote: Enumerating objects: 41, done.
remote: Counting objects: 100% (41/41), done.
remote: Compressing objects: 100% (35/35), done.
remote: Total 45732 (delta 18), reused 18 (delta 4), pack-reused 45691
Receiving objects: 100% (45732/45732), 14.25 MiB | 5.26 MiB/s, done.
Resolving deltas: 100% (29044/29044), done.
[yamato@slave]/tmp% git checkout -b masatake-R-rewrite master
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[yamato@slave]/tmp% cd ctags
[yamato@slave]/tmp/ctags% git checkout -b masatake-R-rewrite master
Switched to a new branch 'masatake-R-rewrite'
[yamato@slave]/tmp/ctags% git pull https://github.com/masatake/ctags.git R-rewrite
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Total 21 (delta 6), reused 6 (delta 6), pack-reused 15
Unpacking objects: 100% (21/21), done.
From https://github.com/masatake/ctags
 * branch              R-rewrite  -> FETCH_HEAD
hint: Waiting for your editor to close the file...
Merge branch 'R-rewrite' of https://github.com/masatake/ctags into masatake-R-rewrite
...
Merge made by the 'recursive' strategy.
 Units/parser-r.r/r-scope.d/args.ctags                  |   3 +
 Units/parser-r.r/r-scope.d/input-2.r                   |  32 ++
 Units/parser-r.r/r-scope.d/input-3.r                   |   8 +
 Units/parser-r.r/r-scope.d/input-4.r                   |   5 +
 Units/parser-r.r/r-scope.d/input-5.r                   |   3 +
 Units/parser-r.r/r-scope.d/input.r                     |  29 ++
 Units/parser-r.r/r-simple.d/expected.tags              |   2 +-
 Units/parser-r.r/r-uppercase-extension.d/expected.tags |   2 +-
 parsers/r.c                                            | 919 +++++++++++++++++++++++++++++------
 9 files changed, 859 insertions(+), 144 deletions(-)
 create mode 100644 Units/parser-r.r/r-scope.d/args.ctags
 create mode 100644 Units/parser-r.r/r-scope.d/input-2.r
 create mode 100644 Units/parser-r.r/r-scope.d/input-3.r
 create mode 100644 Units/parser-r.r/r-scope.d/input-4.r
 create mode 100644 Units/parser-r.r/r-scope.d/input-5.r
 create mode 100644 Units/parser-r.r/r-scope.d/input.r
[yamato@slave]/tmp% bash autogen.sh; ./configure; make
[yamato@slave]/tmp% ./ctags -o - input.r

@suntzuisafterU
Copy link

I have verified all dependencies and followed your instructions. I had the same error. Here is the output:

/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-trace.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-cxx_debug.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-cxx_debug_type.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-debug.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-trace.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-cxx_debug.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-cxx_debug_type.o) has no symbols
/Library/Developer/CommandLineTools/usr/bin/ranlib: file: libctags.a(libctags_a-debug.o) has no symbols
  CCLD     ctags
Undefined symbols for architecture x86_64:
  "_R_TRACE_ENTER", referenced from:
      _parseStatement in libctags.a(libctags_a-r.o)
      _parseRightSide in libctags.a(libctags_a-r.o)
  "_R_TRACE_LEAVE", referenced from:
      _parseStatement in libctags.a(libctags_a-r.o)
      _parseRightSide in libctags.a(libctags_a-r.o)
  "_R_TRACE_TOKEN", referenced from:
      _findRTags in libctags.a(libctags_a-r.o)
  "_R_TRACE_TOKEN_TEXT", referenced from:
      _parseStatement in libctags.a(libctags_a-r.o)
      _parseRightSide in libctags.a(libctags_a-r.o)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [ctags] Error 1
make: *** [all] Error 2

@masatake
Copy link
Member Author

Very strange. The debug mode is partially enabled . I cannot imge the reason.

I would like you to try both:

$ cd ctags
$ make clean; ./configure --enable-debugging;make 

and

$ cd ctags
$ make clean; ./configure --disable-debugging;make 

@suntzuisafterU
Copy link

Just built with the debugging enabled flag successfully. Strange

@suntzuisafterU
Copy link

suntzuisafterU commented Mar 10, 2020

Found one missed variable.
I modified the file below as follows to show that a survives in the scope of f

 cat Units/parser-r.r/r-scope.d/input-3.r
f <- function () {
        d <- (
          1 + 2 + 3 + (a
        <- 4) + 5)

        print(a)


}

f()
./ctags -o - Units/parser-r.r/r-scope.d/input-3.r
d       Units/parser-r.r/r-scope.d/input-3.r    /^      d <- ($/;"      v       function:f
f       Units/parser-r.r/r-scope.d/input-3.r    /^f <- function () {$/;"        f

Continuing to review

@suntzuisafterU
Copy link

From the r-extended.d directory, we miss the library attachment and sourcing of a local file. I am not sure how ctags generally handles this.

./ctags -o - Units/parser-r.r/r-extended.d/input.r
.First  Units/parser-r.r/r-extended.d/input.r   /^.First <- function() {$/;"    f
r-extended.d$ cat input.r
# From:
# https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Customizing-the-environment
.First <- function() {
  options(prompt="$ ", continue="+\t")  # $ is the prompt
  options(digits=5, length=999)         # custom numbers and printout
  x11()                                 # for graphics
  par(pch = "+")                        # plotting character
  source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))
                                        # my personal functions
  library(MASS)                         # attach a package
}


(base)  (masatake-R-rewrite)
r-extended.d$ cat expected.tags
.First	input.r	/^.First <- function() {$/;"	f
MASS	input.r	/^  library(MASS)                         # attach a package$/;"	l
file.path(Sys.getenv("HOME"	input.r	/^  source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))$/;"	s

@suntzuisafterU
Copy link

I also noticed in the file r-scoped/input.r that the following line:

f5 <- function(a5) c(1,
...

Defines a5 as an argument to f5, and this is not picked up in the output:

./ctags -o - Units/parser-r.r/r-scope.d/input.r
c00     Units/parser-r.r/r-scope.d/input.r      /^c00 = 1$/;"   g
c01     Units/parser-r.r/r-scope.d/input.r      /^c01 <- 2$/;"  g
c02     Units/parser-r.r/r-scope.d/input.r      /^c02 <<- 3$/;" g
c50     Units/parser-r.r/r-scope.d/input.r      /^f5((c50 = 4))$/;"     g
c51     Units/parser-r.r/r-scope.d/input.r      /^f5(c51 <- 4)$/;"      g
c52     Units/parser-r.r/r-scope.d/input.r      /^f5(c52 <<- 4)$/;"     g
f0      Units/parser-r.r/r-scope.d/input.r      /^f0 <- function() {$/;"        f
f1      Units/parser-r.r/r-scope.d/input.r      /^f1 <- function(a1) {$/;"      f
f2      Units/parser-r.r/r-scope.d/input.r      /^f2 <- function(a20, a21) {$/;"        f
f4      Units/parser-r.r/r-scope.d/input.r      /^f4 <- function() c02 +$/;"    f
f5      Units/parser-r.r/r-scope.d/input.r      /^f5 <- function(a5) c(1,$/;"   f
r0      Units/parser-r.r/r-scope.d/input.r      /^   r0 <- c00 + c01 + c02$/;"  v       function:f0
r1      Units/parser-r.r/r-scope.d/input.r      /^   r1 = c00 + c01 + c02 + a1$/;"      v       function:f1
r2      Units/parser-r.r/r-scope.d/input.r      /^   r2 <<- c00 + c01 + c02 + a20 + a21$/;"     v       function:f2
r5      Units/parser-r.r/r-scope.d/input.r      /^r5 <- 3,$/;"  v       function:f5

@masatake
Copy link
Member Author

Just built with the debugging enabled flag successfully. Strange
I found its reason. I will push the fix for it.

@suntzuisafterU
Copy link

I have reviewed all the provided test cases, I have one more observation; when

r1 = 0
f <- function() {
  r1 <<- r1 + 1
}

show(f())
show(f())


r1 = 37

g <- function() {
  r1 <- r1 + 1
}

show(g())
show(g())

Is executed, the output is:

1
2
38
38

Because the <<- operator escapes and updates the outer scope variable, instead of making a new one.

Sorry for the delay in getting back to you, hopefully this helps.

@masatake
Copy link
Member Author

@suntzuisafterU, thank you very much.
source and library are not handled yet. handling them is not hard. I can implement the code for them later.

I must fix the other items you spotted first.

masatake added a commit to masatake/ctags that referenced this pull request Mar 11, 2020
In `(' and `)', newline between 'f' and '<-' should be ignored.

    $ cat Units/parser-r.r/r-scope.d/input-3.r
    f <- function () {
	    d <- (
	      1 + 2 + 3 + (a
	    <- 4) + 5)

	    print(a)
    }

In this input, "a" must be captured though "<-" is
not at the same line.
The original code doesn't ignore the newline.

The issue is suggested by @suntzuisafterU in universal-ctags#2452.

Signed-off-by: Masatake YAMATO <[email protected]>
@masatake
Copy link
Member Author

 cat Units/parser-r.r/r-scope.d/input-3.r
f <- function () {
        d <- (
          1 + 2 + 3 + (a
        <- 4) + 5)

        print(a)


}

f()

./ctags -o - Units/parser-r.r/r-scope.d/input-3.r
d       Units/parser-r.r/r-scope.d/input-3.r    /^      d <- ($/;"      v       function:f
f       Units/parser-r.r/r-scope.d/input-3.r    /^f <- function () {$/;"        f

This one may be fixed in the new commits.

@masatake
Copy link
Member Author

r1 = 0
f <- function() {
  r1 <<- r1 + 1
}

show(f())
show(f())


r1 = 37

g <- function() {
  r1 <- r1 + 1
}

show(g())
show(g())

In this case, r1 on the left side of '<<-' and '<-' are not definition. They should be captured as reference tags because they are just updating the value of variable.
The definitions are at the first line 'r1 = 0'. This one introduces a new r1.

I found one derived input.

f <- function() {
  r1 <<- 1
}

show(r1)

In this case 'r1' is defined in the code f.

To make a better tags file, ctags can know whether a name is already appeared or not.
This functionality is needed ont only the R parser but also...all the parsers.
I"m working on this topic at #2413.

With the symbol table, the R parser can know whether r1 is defined already or not.

f <- function() {
  r1 <<- 1
}

In this case, r1 is not defined before parsing f. So we can say r1 is defined here.

f <- function() {
  r1 <<- 1
}
g <- function() {
  r1 <<- 2
}

ctags works in lexical level. So r1 in f and r1 in g are captured as definition tags.

What I cannot find the way how I should do is scope.

f <- function() {
  r1 <<- 1
}

In R's semantic, the scope of r1 is empty; r1 is in the top level namespace.
However, I'm not happy with this. I would like to attach the information "r is defined during running f()"
to the tag for r1. In that case how do I call the field for it? The name executionScope hits me.

f <- function () {
  g <- function () {
     r1 <<- 1
  }
}

In this case the tag entry for r1 is...

f     .... kind:function
g    ...  kind:functionVar scope:f
r1   .... kind:functionVar scope:f exectuionScope:f.g

executionScope is not special for R. So it it better to make it short: xscope.

Maybe a better term is defined in compiler theory.
I have to think more about this topic.

@masatake
Copy link
Member Author

I also noticed in the file r-scoped/input.r that the following line:

f5 <- function(a5) c(1,
...

Yes. I'm thinking about capturing a1 with another kind tag: parameter.
Adding this may be easy.

So the most important one in this situation is ... deciding how we should do with "<<-".

@suntzuisafterU
Copy link

In this example:

r1 = 0
f <- function() {
  r1 <<- r1 + 1
}

show(f())
show(f())


r1 = 37

g <- function() {
  r1 <- r1 + 1  # This defines a local variable r1, and also implicitly returns it
}

show(g())
show(g())

There are actually 2 instances of variables with the symbol r1.
The r1 value that is local to g is also implicitly returned.

Also, when a variable is defined within a function via <<-, it can redefine variables already
existing in the global scope. Then really there can be multiple statements that may be responsible for defining one variable in the global scope.

Also note that:

a0 <- "Here"

f <- function() {
  a0 <- 42
  g <- function() {
    a0 <<- 99
  }
  show(a0)
  g()
  show(a0)
}

f()

show(a0)

Prints the following:

42
99
Here

Indicating that <<- simply redefines the smallest enclosing scopes variable with the same name, or defines the variable in the global scope if it does not exist.

Handling multiple definitions in the same scope may be problematic, but semantically this
is what is happening. There may not be an easy solution.

The orignal code doesn't consider the case that a scope
is empty. As the result, foreachEntriesInScope (index, NULL,...)
raises SEGV.

Signed-off-by: Masatake YAMATO <[email protected]>
@masatake masatake changed the title [WIP] R: rewrite the parser R: rewrite the parser May 5, 2020
@coveralls
Copy link

coveralls commented May 5, 2020

Coverage Status

Coverage increased (+0.05%) to 86.627% when pulling 82036e8 on masatake:R-rewrite into 3671ad7 on universal-ctags:master.

@masatake
Copy link
Member Author

masatake commented May 5, 2020

Using newly introduced facility called "symbol table", I have refined the original pull request.

TODO

  • support file extension ".s" again,
  • improve coverage of test cases,
  • add notes about the behaviour of R to dccs/parser-r.rst.

The last one may be important when thinking about improving JavaScript parser.

anyKindsEntryInScopeRecursive does the mostly same as anyKindsEntryInScope(),
but does recursively from the given scope to the root scope upward.

Signed-off-by: Masatake YAMATO <[email protected]>
@codecov
Copy link

codecov bot commented May 5, 2020

Codecov Report

Merging #2452 into master will increase coverage by 0.04%.
The diff coverage is 86.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2452      +/-   ##
==========================================
+ Coverage   86.47%   86.52%   +0.04%     
==========================================
  Files         182      182              
  Lines       37850    38280     +430     
==========================================
+ Hits        32732    33122     +390     
- Misses       5118     5158      +40     
Impacted Files Coverage Δ
parsers/r.c 85.83% <85.65%> (-9.76%) ⬇️
main/entry.c 88.72% <100.00%> (+2.29%) ⬆️
main/tokeninfo.c 95.78% <100.00%> (ø)
main/rbtree.c 36.81% <0.00%> (+4.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3671ad7...82036e8. Read the comment docs.

@masatake masatake force-pushed the R-rewrite branch 2 times, most recently from 13f19b7 to 30628a6 Compare May 6, 2020 09:29
@masatake
Copy link
Member Author

masatake commented May 6, 2020

The coverage of the scanner is too low.

* A broken ifdef condtion for macros for debugging is spotted
  by @suntzuisafterU.

* The following issue in the original pull request is reported
  by @suntzuisafterU.

  In `(' and `)', newline between 'f' and '<-' should be ignored.

    $ cat Units/parser-r.r/r-scope.d/input-3.r
    f <- function () {
	    d <- (
	      1 + 2 + 3 + (a
	    <- 4) + 5)

	    print(a)
    }

  In this input, "a" must be captured though "<-" is
  not at the same line.
  The original code doesn't ignore the newline.

Signed-off-by: Masatake YAMATO <[email protected]>
@masatake masatake force-pushed the R-rewrite branch 2 times, most recently from 3273574 to 71bdf7d Compare May 7, 2020 10:05
They were kinds for definition tags.
Now libraries attached by library function and scripts
loaded by source functions are captured as reference tags.

New code accepts only a string literal or identifier as
a library or a script. The new code doesn't recognize
compound expressions like:

	 source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))

Signed-off-by: Masatake YAMATO <[email protected]>
R is a dynamic language; a variable is not declared explicitly.
If assigning a value to a variable and it is the first time, ctags
should capture it as a variable definition.

ctags doesn't and should not evaluate the code, so detecting the
first appearance of a variable is not easy. However, there are areas
we can improve.

* parameter

  If assigning a value to a variable listed in the function parameter
  list, it should not be tagged. Instead the parameter should be tagged.

* only the first assignment in the scope

  Only the first assignment to a variable in the scope should be tagged.
  The later assignments to the same variable in the same scope should be
  ignored.

This behavior is very different from the old R parser. The old R parser
tags all assignments.

Unlike JavaScript parser, the new R parser tags all assignments at the
top level if the values are functions.

Signed-off-by: Masatake YAMATO <[email protected]>
…ariables

    f <- function () {
      x <<- 1
    }

In this case x should be captured as globalVar kind object.

    y <- 1
    g <- function () {
      y <<- 2
    }

In this case y in g should not be captured.

Signed-off-by: Masatake YAMATO <[email protected]>
Signed-off-by: Masatake YAMATO <[email protected]>
@masatake masatake merged commit ca0412a into universal-ctags:master May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

R parser does not recognize = based assignment
3 participants